The aim of this notebook is to make use of the word2vec model to find similar songs

In [1]:
import pandas as pd
import numpy as np
import gensim.models.word2vec as w2v
import multiprocessing
import os
import re
import pprint
import sklearn.manifold
import matplotlib.pyplot as plt
In [2]:
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"
In [3]:
#To plot in collab:

def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
#call this function en cada celda que hagas un plot

Though non english artists were removed, the dataset contained Hindi lyrics of Lata Mangeshkar written in English. Therefore, I decided to remove all songs sung by her.

In [4]:
songs = pd.read_csv("songlyrics/songdata.csv", header=0)
#songs.head()
songs = songs[songs.artist != 'Lata Mangeshkar']
songs.head()
Out[4]:
artist song link text
0 ABBA Ahe's My Kind Of Girl /a/abba/ahes+my+kind+of+girl_20598417.html Look at her face, it's a wonderful face \nAnd...
1 ABBA Andante, Andante /a/abba/andante+andante_20002708.html Take it easy with me, please \nTouch me gentl...
2 ABBA As Good As New /a/abba/as+good+as+new_20003033.html I'll never know why I had to go \nWhy I had t...
3 ABBA Bang /a/abba/bang_20598415.html Making somebody happy is a question of give an...
4 ABBA Bang-A-Boomerang /a/abba/bang+a+boomerang_20002668.html Making somebody happy is a question of give an...

To train the word2vec model, we first need to build its vocabulary. To do that, I iterated over each song and added it to an array that can later be fed to the model.

In [5]:
text_corpus = []
for song in songs['text']:
    words = song.lower().split()
    text_corpus.append(words)



# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 50
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

songs2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

songs2vec.build_vocab(text_corpus)
print (len(text_corpus))
57618
In [6]:
import time
start_time = time.time()



songs2vec.train(text_corpus, total_examples=songs2vec.corpus_count, epochs=2)

if not os.path.exists("trained"):
    os.makedirs("trained")

songs2vec.save(os.path.join("trained", "songs2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))
--- 50.78345012664795 seconds ---
In [7]:
songs2vec = w2v.Word2Vec.load(os.path.join("trained", "songs2vectors.w2v"))

Let's explore our model

Find similar words

In [8]:
songs2vec.wv.most_similar("love")
Out[8]:
[('love,', 0.9272359013557434),
 ('love.', 0.9169631004333496),
 ('love...', 0.8991478681564331),
 ('desire.', 0.8820338249206543),
 ('apart)', 0.8780913352966309),
 ('hell)', 0.8778669834136963),
 ('loving', 0.8776887655258179),
 ('affair,', 0.8763388395309448),
 ('share.', 0.8751762509346008),
 ('too)', 0.8748354911804199)]
In [9]:
songs2vec.wv.most_similar("fuck")
Out[9]:
[('haters', 0.9017799496650696),
 ('fuck,', 0.8963127136230469),
 ("fuckin'", 0.892696738243103),
 ('yall', 0.8908715844154358),
 ('diss', 0.8907990455627441),
 ('homie', 0.8905448913574219),
 ("y'all", 0.8822363615036011),
 ('nigga', 0.8762768507003784),
 ('bitch!', 0.8723354339599609),
 ('nigga,', 0.8682982921600342)]
In [10]:
songs2vec.wv.most_similar("coffee")
Out[10]:
[('tea', 0.9271487593650818),
 ('breakfast', 0.9034494757652283),
 ('cup', 0.8701494336128235),
 ('gin', 0.8694416284561157),
 ('liquor', 0.8655227422714233),
 ('pills', 0.8640533089637756),
 ('beer', 0.8590098023414612),
 ('bottle', 0.85589200258255),
 ('champagne', 0.8478769063949585),
 ('pint', 0.8457052707672119)]
In [11]:
songs2vec.wv.most_similar("espresso")
Out[11]:
[('ref:', 0.9786605834960938),
 ('microchips', 0.9774864912033081),
 ('callin,', 0.9772010445594788),
 ('females,', 0.9771894216537476),
 ('mohamed', 0.9771457314491272),
 ('mildew', 0.9768126010894775),
 ('occupy.', 0.9767889976501465),
 ('4-5', 0.9765291213989258),
 ('awkward,', 0.9759055376052856),
 ('arrives.', 0.9756759405136108)]

Words out of context

In [12]:
songs2vec.wv.doesnt_match("happiness love joy hate".split())
Out[12]:
'hate'
In [13]:
songs2vec.wv.doesnt_match("breakfast milk lunch dinner".split())
Out[13]:
'milk'
In [14]:
songs2vec.wv.doesnt_match("morning evening night sunday".split())
Out[14]:
'sunday'
In [15]:
songs2vec.wv.doesnt_match("high low jump".split())
Out[15]:
'jump'
In [16]:
songs2vec.most_similar(positive=['woman', 'king'], negative=['man'])
#queen
Out[16]:
[('alleluia,', 0.8264390230178833),
 ('jesus', 0.8153643012046814),
 ('christ,', 0.8135936260223389),
 ('glory', 0.7958534955978394),
 ('king.', 0.793578028678894),
 ('gift', 0.7839601635932922),
 ('lord', 0.7798833847045898),
 ('birth.', 0.7764042019844055),
 ('savior', 0.7760918140411377),
 ('glorious', 0.7760848999023438)]
In [17]:
songs2vec.most_similar(positive=['gin', 'whiskey'], negative=['eggs'])
Out[17]:
[('whiskey,', 0.8293017148971558),
 ('drink', 0.801476776599884),
 ('drinking', 0.775546669960022),
 ("drinkin'", 0.7749837636947632),
 ('bottle', 0.7738522291183472),
 ('rye', 0.7691442966461182),
 ('wine', 0.7619420289993286),
 ('tea', 0.7555461525917053),
 ('tequila', 0.7458945512771606),
 ('wine,', 0.7443379759788513)]

Semantic distance between words

In [18]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = songs2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2))
In [19]:
nearest_similarity_cosmul("paris", "france", "alabama")
paris es a france, lo que georgia es a alabama
In [20]:
nearest_similarity_cosmul("paris", "france", "london")
paris es a france, lo que london, es a london
In [21]:
nearest_similarity_cosmul("happy", "sad", "alone")
happy es a sad, lo que alone, es a alone
In [22]:
nearest_similarity_cosmul("near", "far", "london")
near es a far, lo que down es a london

With the word vector embeddings in place, it is now time to calculate the normalised vector sum of each song. This process can take some time since it has to be done for each of 57,000 songs.

In [23]:
print(songs2vec['un-right'])
def songVector(row):
    vector_sum = 0
    words = row.lower().split()
    for word in words:
        vector_sum = vector_sum + songs2vec[word]
    vector_sum = vector_sum.reshape(1,-1)
    normalised_vector_sum = sklearn.preprocessing.normalize(vector_sum)
    return normalised_vector_sum


import time
start_time = time.time()

songs['song_vector'] = songs['text'].apply(songVector)
[-0.03835382  0.00890708 -0.02956764 -0.00720456  0.0171508   0.0843273
 -0.04811256 -0.0590744   0.02842597  0.02928108 -0.04603193 -0.00371034
  0.05254248  0.01675174  0.02645272 -0.01925546 -0.00163119  0.05740239
 -0.04140217  0.01041649  0.02549906  0.04465934 -0.06834672  0.00298804
 -0.02285406 -0.00271651 -0.07178975 -0.00360564  0.03512111  0.00865354
 -0.02098365 -0.05662556  0.03978603  0.01826817  0.02564787 -0.03824577
 -0.09348247 -0.04387096  0.10630674  0.01325714  0.01871802  0.07183068
  0.06937303 -0.05744873 -0.02069402 -0.01954402  0.02351294 -0.0135747
 -0.0104466   0.0198248 ]

t-sne and random song selection

The songs have 50 dimensions each. Application of t-sne is memory intensive and hence it is slightly easier on the computer to use a random sample of the 57,000 songs.

In [24]:
song_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(songs, test_size = 0.9)


for song_vector in train['song_vector']:
    song_vectors.append(song_vector)

train.head(10)
Out[24]:
artist song link text song_vector
8860 J Cole Disgusting /j/j+cole/disgusting_20910369.html Can't help but think about it all the time. \... [[-0.25725365, 0.07351082, -0.1043328, -0.0525...
21825 Xscape One Of Those Love Songs /x/xscape/one+of+those+love+songs_20147669.html When you're far from me there's a melody \nTh... [[-0.28554463, 0.07458883, -0.04793184, -0.022...
40488 Kirk Franklin Better /k/kirk+franklin/better_20370473.html If I could I, I'd get away \nFar from all thi... [[-0.23288684, 0.040610936, -0.050898734, -0.0...
7875 HIM Fade Into You /h/him/fade+into+you_20626473.html I want to hold the hand inside you \nI want t... [[-0.27420214, 0.06918051, -0.009710236, -0.02...
49323 Queen I'm In Love With My Car /q/queen/im+in+love+with+my+car_20112603.html Oooh \nThe machine of a dream, such a clean m... [[-0.280056, 0.05401318, -0.07008333, -0.03232...
13302 Morrissey Human Being /m/morrissey/human+being_20823815.html Uno due tre \nWell, if you don't like it \nG... [[-0.21847965, 0.043608855, -0.05580177, -0.02...
44873 Natalie Cole Say You Love Me /n/natalie+cole/say+you+love+me_20098208.html Say you love me, say I'm the one your eyes see... [[-0.2411642, 0.066699274, -0.072465226, -0.05...
28605 Cyndi Lauper You Don't Know /c/cyndi+lauper/you+dont+know_20035180.html You don't know where you belong \nYou should ... [[-0.25898057, 0.06640644, -0.043958418, -0.03...
1966 Britney Spears Hold It Against Me /b/britney+spears/hold+it+against+me_20899715.... Hey, over there \nPlease, forgive me \nIf I'... [[-0.24872817, 0.05712086, -0.07512119, -0.051...
40095 Kenny Rogers Loving Armes /k/kenny+rogers/loving+armes_20812125.html If you could see me now \nThe one who said \... [[-0.25795636, 0.05717155, -0.011932116, -0.03...

I had a fairly measly 4gb machine and wasn't able to generate a more accurate model. However, one can play around with the number of iterations, learning rate and other factors to fit the model better. If you have too many dimensions (~300+), it might make sense to use PCA first and then t-sne.

In [25]:
X = np.array(song_vectors).reshape((5761, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=250, random_state=0, verbose=2)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5761 samples in 0.009s...
[t-SNE] Computed neighbors for 5761 samples in 4.928s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5761
[t-SNE] Computed conditional probabilities for sample 2000 / 5761
[t-SNE] Computed conditional probabilities for sample 3000 / 5761
[t-SNE] Computed conditional probabilities for sample 4000 / 5761
[t-SNE] Computed conditional probabilities for sample 5000 / 5761
[t-SNE] Computed conditional probabilities for sample 5761 / 5761
[t-SNE] Mean sigma: 0.040309
[t-SNE] Computed conditional probabilities in 0.269s
[t-SNE] Iteration 50: error = 87.6417007, gradient norm = 0.0217770 (50 iterations in 10.645s)
[t-SNE] Iteration 100: error = 87.7263947, gradient norm = 0.0167939 (50 iterations in 6.869s)
[t-SNE] Iteration 150: error = 86.7498779, gradient norm = 0.0436074 (50 iterations in 5.224s)
[t-SNE] Iteration 200: error = 86.9479675, gradient norm = 0.0168996 (50 iterations in 4.614s)
[t-SNE] Iteration 250: error = 86.9554672, gradient norm = 0.0227164 (50 iterations in 4.552s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 86.955467
[t-SNE] KL divergence after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
--- 37.116883993148804 seconds ---
In [26]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)

Joining two dataframes to obtain each song's corresponding X,Y co-ordinate.

In [27]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()
Out[27]:
artist song link text song_vector X Y
0 J Cole Disgusting /j/j+cole/disgusting_20910369.html Can't help but think about it all the time. \... [[-0.25725365, 0.07351082, -0.1043328, -0.0525... 0.005373 0.123836
1 Xscape One Of Those Love Songs /x/xscape/one+of+those+love+songs_20147669.html When you're far from me there's a melody \nTh... [[-0.28554463, 0.07458883, -0.04793184, -0.022... -0.001371 -0.145588
2 Kirk Franklin Better /k/kirk+franklin/better_20370473.html If I could I, I'd get away \nFar from all thi... [[-0.23288684, 0.040610936, -0.050898734, -0.0... -0.000319 0.214653
3 HIM Fade Into You /h/him/fade+into+you_20626473.html I want to hold the hand inside you \nI want t... [[-0.27420214, 0.06918051, -0.009710236, -0.02... -0.001136 0.089799
4 Queen I'm In Love With My Car /q/queen/im+in+love+with+my+car_20112603.html Oooh \nThe machine of a dream, such a clean m... [[-0.280056, 0.05401318, -0.07008333, -0.03232... 0.001489 -0.061845

Plotting the results

Using plotly, I plotted the results so that it becomes easier to explore similar songs based on their colors and clusters.

In [28]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['song'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)
In [29]:
import plotly.express as px

df = px.data.iris()
print(df)
print(type(df))
#fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
#              color='species')
#fig.show()
sepal_length  sepal_width  petal_length  petal_width    species  \
0             5.1          3.5           1.4          0.2     setosa   
1             4.9          3.0           1.4          0.2     setosa   
2             4.7          3.2           1.3          0.2     setosa   
3             4.6          3.1           1.5          0.2     setosa   
4             5.0          3.6           1.4          0.2     setosa   
..            ...          ...           ...          ...        ...   
145           6.7          3.0           5.2          2.3  virginica   
146           6.3          2.5           5.0          1.9  virginica   
147           6.5          3.0           5.2          2.0  virginica   
148           6.2          3.4           5.4          2.3  virginica   
149           5.9          3.0           5.1          1.8  virginica   

     species_id  
0             1  
1             1  
2             1  
3             1  
4             1  
..          ...  
145           3  
146           3  
147           3  
148           3  
149           3  

[150 rows x 6 columns]
<class 'pandas.core.frame.DataFrame'>
In [30]:
###plot cluster by ARTIST

print(type(two_dimensional_songs))

import plotly.express as px
fig = px.scatter(two_dimensional_songs, x='X', y='Y',color='artist')
fig.show()
<class 'pandas.core.frame.DataFrame'>
In [31]:
import plotly.graph_objects as go
import numpy as np

fig = go.Figure(data=go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['artist']+ "_"+two_dimensional_songs['song'] ,
    mode='markers',
    marker=dict(
        size= 10,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
))

fig.show()
In [32]:
## LOOK FOR COMMON SONGS AND ANALYZE THE TEXT
In [33]:
import plotly.graph_objects as go
import numpy as np

fig = go.Figure(data=go.Scatter(
    y = np.random.randn(500),
    mode='markers',
    marker=dict(
        size=16,
        color=np.random.randn(500), #set color equal to a variable
        colorscale='Viridis', # one of plotly colorscales
        showscale=True
    )
))

fig.show()
In [34]:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
              color='species')
fig.show()

Vamos a jugar con los valores del modelo

In [31]:
X = np.array(song_vectors).reshape((5761, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=2000, random_state=0, verbose=2, learning_rate=1000)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5761 samples in 0.009s...
[t-SNE] Computed neighbors for 5761 samples in 4.647s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5761
[t-SNE] Computed conditional probabilities for sample 2000 / 5761
[t-SNE] Computed conditional probabilities for sample 3000 / 5761
[t-SNE] Computed conditional probabilities for sample 4000 / 5761
[t-SNE] Computed conditional probabilities for sample 5000 / 5761
[t-SNE] Computed conditional probabilities for sample 5761 / 5761
[t-SNE] Mean sigma: 0.040309
[t-SNE] Computed conditional probabilities in 0.275s
[t-SNE] Iteration 50: error = 97.5246582, gradient norm = 0.1373755 (50 iterations in 8.239s)
[t-SNE] Iteration 100: error = 98.1395569, gradient norm = 0.1257151 (50 iterations in 6.428s)
[t-SNE] Iteration 150: error = 99.1509323, gradient norm = 0.1189510 (50 iterations in 6.902s)
[t-SNE] Iteration 200: error = 99.3906021, gradient norm = 0.1162611 (50 iterations in 5.996s)
[t-SNE] Iteration 250: error = 99.1292801, gradient norm = 0.1193485 (50 iterations in 5.982s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 99.129280
[t-SNE] Iteration 300: error = 3.0610936, gradient norm = 0.0015430 (50 iterations in 4.100s)
[t-SNE] Iteration 350: error = 2.8837545, gradient norm = 0.0007704 (50 iterations in 4.947s)
[t-SNE] Iteration 400: error = 2.8097837, gradient norm = 0.0003770 (50 iterations in 3.257s)
[t-SNE] Iteration 450: error = 2.7790341, gradient norm = 0.0001193 (50 iterations in 3.123s)
[t-SNE] Iteration 500: error = 2.7583516, gradient norm = 0.0000942 (50 iterations in 3.160s)
[t-SNE] Iteration 550: error = 2.7445662, gradient norm = 0.0000804 (50 iterations in 3.156s)
[t-SNE] Iteration 600: error = 2.7352762, gradient norm = 0.0000646 (50 iterations in 3.161s)
[t-SNE] Iteration 650: error = 2.7284868, gradient norm = 0.0000563 (50 iterations in 3.262s)
[t-SNE] Iteration 700: error = 2.7232659, gradient norm = 0.0000503 (50 iterations in 3.267s)
[t-SNE] Iteration 750: error = 2.7193623, gradient norm = 0.0000486 (50 iterations in 3.204s)
[t-SNE] Iteration 800: error = 2.7165134, gradient norm = 0.0000460 (50 iterations in 3.237s)
[t-SNE] Iteration 850: error = 2.7139087, gradient norm = 0.0000482 (50 iterations in 3.298s)
[t-SNE] Iteration 900: error = 2.7117136, gradient norm = 0.0000371 (50 iterations in 3.270s)
[t-SNE] Iteration 950: error = 2.7095304, gradient norm = 0.0000341 (50 iterations in 3.278s)
[t-SNE] Iteration 1000: error = 2.7075562, gradient norm = 0.0000311 (50 iterations in 3.195s)
[t-SNE] Iteration 1050: error = 2.7056293, gradient norm = 0.0000308 (50 iterations in 3.157s)
[t-SNE] Iteration 1100: error = 2.7039552, gradient norm = 0.0000311 (50 iterations in 3.161s)
[t-SNE] Iteration 1150: error = 2.7024937, gradient norm = 0.0000282 (50 iterations in 3.211s)
[t-SNE] Iteration 1200: error = 2.7012210, gradient norm = 0.0000285 (50 iterations in 3.163s)
[t-SNE] Iteration 1250: error = 2.7001784, gradient norm = 0.0000293 (50 iterations in 3.171s)
[t-SNE] Iteration 1300: error = 2.6992404, gradient norm = 0.0000250 (50 iterations in 3.166s)
[t-SNE] Iteration 1350: error = 2.6981442, gradient norm = 0.0000262 (50 iterations in 3.175s)
[t-SNE] Iteration 1400: error = 2.6972899, gradient norm = 0.0000237 (50 iterations in 3.178s)
[t-SNE] Iteration 1450: error = 2.6965320, gradient norm = 0.0000238 (50 iterations in 3.170s)
[t-SNE] Iteration 1500: error = 2.6959705, gradient norm = 0.0000256 (50 iterations in 3.338s)
[t-SNE] Iteration 1550: error = 2.6954226, gradient norm = 0.0000226 (50 iterations in 3.352s)
[t-SNE] Iteration 1600: error = 2.6948814, gradient norm = 0.0000209 (50 iterations in 3.328s)
[t-SNE] Iteration 1650: error = 2.6943386, gradient norm = 0.0000224 (50 iterations in 3.436s)
[t-SNE] Iteration 1700: error = 2.6939495, gradient norm = 0.0000192 (50 iterations in 3.313s)
[t-SNE] Iteration 1750: error = 2.6934733, gradient norm = 0.0000225 (50 iterations in 3.197s)
[t-SNE] Iteration 1800: error = 2.6932263, gradient norm = 0.0000209 (50 iterations in 3.246s)
[t-SNE] Iteration 1850: error = 2.6928627, gradient norm = 0.0000189 (50 iterations in 3.618s)
[t-SNE] Iteration 1900: error = 2.6923983, gradient norm = 0.0000189 (50 iterations in 3.285s)
[t-SNE] Iteration 1950: error = 2.6920607, gradient norm = 0.0000181 (50 iterations in 3.218s)
[t-SNE] Iteration 2000: error = 2.6916862, gradient norm = 0.0000187 (50 iterations in 3.211s)
[t-SNE] KL divergence after 2000 iterations: 2.691686
--- 154.49582600593567 seconds ---
In [32]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [33]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()
Out[33]:
artist song link text song_vector X Y
0 J Cole Disgusting /j/j+cole/disgusting_20910369.html Can't help but think about it all the time. \... [[-0.25725365, 0.07351082, -0.1043328, -0.0525... -3.864289 -61.762867
1 Xscape One Of Those Love Songs /x/xscape/one+of+those+love+songs_20147669.html When you're far from me there's a melody \nTh... [[-0.28554463, 0.07458883, -0.04793184, -0.022... 25.786121 37.638924
2 Kirk Franklin Better /k/kirk+franklin/better_20370473.html If I could I, I'd get away \nFar from all thi... [[-0.23288684, 0.040610936, -0.050898734, -0.0... -41.167305 1.697515
3 HIM Fade Into You /h/him/fade+into+you_20626473.html I want to hold the hand inside you \nI want t... [[-0.27420214, 0.06918051, -0.009710236, -0.02... -32.818993 43.961792
4 Queen I'm In Love With My Car /q/queen/im+in+love+with+my+car_20112603.html Oooh \nThe machine of a dream, such a clean m... [[-0.280056, 0.05401318, -0.07008333, -0.03232... 10.199233 -45.278549
In [34]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['song'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)
In [35]:
###plot cluster by ARTIST

print(type(two_dimensional_songs))

import plotly.express as px
fig = px.scatter(two_dimensional_songs, x='X', y='Y',color='artist')
fig.show()
<class 'pandas.core.frame.DataFrame'>

Learning rate muy algo, probamos a bajarlo

In [36]:
X = np.array(song_vectors).reshape((5761, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=2000, random_state=0, verbose=2, learning_rate=500)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5761 samples in 0.011s...
[t-SNE] Computed neighbors for 5761 samples in 4.976s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5761
[t-SNE] Computed conditional probabilities for sample 2000 / 5761
[t-SNE] Computed conditional probabilities for sample 3000 / 5761
[t-SNE] Computed conditional probabilities for sample 4000 / 5761
[t-SNE] Computed conditional probabilities for sample 5000 / 5761
[t-SNE] Computed conditional probabilities for sample 5761 / 5761
[t-SNE] Mean sigma: 0.040309
[t-SNE] Computed conditional probabilities in 0.275s
[t-SNE] Iteration 50: error = 90.1474686, gradient norm = 0.1114292 (50 iterations in 6.484s)
[t-SNE] Iteration 100: error = 90.0581131, gradient norm = 0.0953637 (50 iterations in 6.335s)
[t-SNE] Iteration 150: error = 90.2502289, gradient norm = 0.0938459 (50 iterations in 7.324s)
[t-SNE] Iteration 200: error = 90.5688171, gradient norm = 0.0886783 (50 iterations in 9.301s)
[t-SNE] Iteration 250: error = 89.8411484, gradient norm = 0.0951232 (50 iterations in 7.316s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 89.841148
[t-SNE] Iteration 300: error = 3.0233383, gradient norm = 0.0012260 (50 iterations in 3.949s)
[t-SNE] Iteration 350: error = 2.8826993, gradient norm = 0.0002807 (50 iterations in 3.170s)
[t-SNE] Iteration 400: error = 2.8152347, gradient norm = 0.0001641 (50 iterations in 3.200s)
[t-SNE] Iteration 450: error = 2.7791057, gradient norm = 0.0001160 (50 iterations in 3.153s)
[t-SNE] Iteration 500: error = 2.7569296, gradient norm = 0.0000894 (50 iterations in 3.167s)
[t-SNE] Iteration 550: error = 2.7425647, gradient norm = 0.0000744 (50 iterations in 3.201s)
[t-SNE] Iteration 600: error = 2.7326236, gradient norm = 0.0000630 (50 iterations in 3.205s)
[t-SNE] Iteration 650: error = 2.7253034, gradient norm = 0.0000724 (50 iterations in 3.194s)
[t-SNE] Iteration 700: error = 2.7195554, gradient norm = 0.0000469 (50 iterations in 3.197s)
[t-SNE] Iteration 750: error = 2.7146873, gradient norm = 0.0000433 (50 iterations in 3.196s)
[t-SNE] Iteration 800: error = 2.7109094, gradient norm = 0.0000395 (50 iterations in 3.180s)
[t-SNE] Iteration 850: error = 2.7077796, gradient norm = 0.0000351 (50 iterations in 3.306s)
[t-SNE] Iteration 900: error = 2.7053299, gradient norm = 0.0000364 (50 iterations in 3.349s)
[t-SNE] Iteration 950: error = 2.7034748, gradient norm = 0.0000364 (50 iterations in 3.238s)
[t-SNE] Iteration 1000: error = 2.7018118, gradient norm = 0.0000317 (50 iterations in 3.269s)
[t-SNE] Iteration 1050: error = 2.7003477, gradient norm = 0.0000325 (50 iterations in 3.226s)
[t-SNE] Iteration 1100: error = 2.6990511, gradient norm = 0.0000306 (50 iterations in 3.253s)
[t-SNE] Iteration 1150: error = 2.6978490, gradient norm = 0.0000268 (50 iterations in 3.245s)
[t-SNE] Iteration 1200: error = 2.6967578, gradient norm = 0.0000308 (50 iterations in 3.241s)
[t-SNE] Iteration 1250: error = 2.6958396, gradient norm = 0.0000265 (50 iterations in 3.248s)
[t-SNE] Iteration 1300: error = 2.6950550, gradient norm = 0.0000250 (50 iterations in 3.346s)
[t-SNE] Iteration 1350: error = 2.6941917, gradient norm = 0.0000255 (50 iterations in 3.381s)
[t-SNE] Iteration 1400: error = 2.6934628, gradient norm = 0.0000216 (50 iterations in 3.352s)
[t-SNE] Iteration 1450: error = 2.6926904, gradient norm = 0.0000219 (50 iterations in 3.280s)
[t-SNE] Iteration 1500: error = 2.6919670, gradient norm = 0.0000226 (50 iterations in 3.266s)
[t-SNE] Iteration 1550: error = 2.6913095, gradient norm = 0.0000220 (50 iterations in 3.440s)
[t-SNE] Iteration 1600: error = 2.6905828, gradient norm = 0.0000221 (50 iterations in 3.362s)
[t-SNE] Iteration 1650: error = 2.6900206, gradient norm = 0.0000232 (50 iterations in 3.423s)
[t-SNE] Iteration 1700: error = 2.6896429, gradient norm = 0.0000224 (50 iterations in 3.382s)
[t-SNE] Iteration 1750: error = 2.6893001, gradient norm = 0.0000237 (50 iterations in 3.281s)
[t-SNE] Iteration 1800: error = 2.6889610, gradient norm = 0.0000195 (50 iterations in 3.288s)
[t-SNE] Iteration 1850: error = 2.6885598, gradient norm = 0.0000196 (50 iterations in 3.312s)
[t-SNE] Iteration 1900: error = 2.6881874, gradient norm = 0.0000190 (50 iterations in 3.279s)
[t-SNE] Iteration 1950: error = 2.6877928, gradient norm = 0.0000188 (50 iterations in 3.286s)
[t-SNE] Iteration 2000: error = 2.6874516, gradient norm = 0.0000215 (50 iterations in 3.282s)
[t-SNE] KL divergence after 2000 iterations: 2.687452
--- 157.1789288520813 seconds ---
In [37]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [38]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()
Out[38]:
artist song link text song_vector X Y
0 J Cole Disgusting /j/j+cole/disgusting_20910369.html Can't help but think about it all the time. \... [[-0.25725365, 0.07351082, -0.1043328, -0.0525... 56.121349 12.880310
1 Xscape One Of Those Love Songs /x/xscape/one+of+those+love+songs_20147669.html When you're far from me there's a melody \nTh... [[-0.28554463, 0.07458883, -0.04793184, -0.022... -32.885647 -32.692928
2 Kirk Franklin Better /k/kirk+franklin/better_20370473.html If I could I, I'd get away \nFar from all thi... [[-0.23288684, 0.040610936, -0.050898734, -0.0... -16.512125 42.989262
3 HIM Fade Into You /h/him/fade+into+you_20626473.html I want to hold the hand inside you \nI want t... [[-0.27420214, 0.06918051, -0.009710236, -0.02... -40.136189 14.225302
4 Queen I'm In Love With My Car /q/queen/im+in+love+with+my+car_20112603.html Oooh \nThe machine of a dream, such a clean m... [[-0.280056, 0.05401318, -0.07008333, -0.03232... 42.373169 -2.230470
In [39]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['song'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)
In [40]:
###plot cluster by ARTIST

print(type(two_dimensional_songs))

import plotly.express as px
fig = px.scatter(two_dimensional_songs, x='X', y='Y',color='artist')
fig.show()
<class 'pandas.core.frame.DataFrame'>

Aumentamos el número de iteraciones

In [42]:
X = np.array(song_vectors).reshape((5761, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=8000, random_state=0, verbose=2, learning_rate=400)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5761 samples in 0.009s...
[t-SNE] Computed neighbors for 5761 samples in 4.369s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5761
[t-SNE] Computed conditional probabilities for sample 2000 / 5761
[t-SNE] Computed conditional probabilities for sample 3000 / 5761
[t-SNE] Computed conditional probabilities for sample 4000 / 5761
[t-SNE] Computed conditional probabilities for sample 5000 / 5761
[t-SNE] Computed conditional probabilities for sample 5761 / 5761
[t-SNE] Mean sigma: 0.040309
[t-SNE] Computed conditional probabilities in 0.291s
[t-SNE] Iteration 50: error = 89.0030975, gradient norm = 0.0871931 (50 iterations in 7.812s)
[t-SNE] Iteration 100: error = 89.0293884, gradient norm = 0.0777942 (50 iterations in 6.629s)
[t-SNE] Iteration 150: error = 88.2886581, gradient norm = 0.0909523 (50 iterations in 6.060s)
[t-SNE] Iteration 200: error = 88.4030991, gradient norm = 0.0685605 (50 iterations in 7.449s)
[t-SNE] Iteration 250: error = 88.2871857, gradient norm = 0.0903342 (50 iterations in 5.456s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 88.287186
[t-SNE] Iteration 300: error = 3.0585337, gradient norm = 0.0016413 (50 iterations in 4.586s)
[t-SNE] Iteration 350: error = 2.8927200, gradient norm = 0.0003151 (50 iterations in 3.152s)
[t-SNE] Iteration 400: error = 2.8224361, gradient norm = 0.0001719 (50 iterations in 2.793s)
[t-SNE] Iteration 450: error = 2.7826881, gradient norm = 0.0001304 (50 iterations in 3.135s)
[t-SNE] Iteration 500: error = 2.7581644, gradient norm = 0.0000930 (50 iterations in 3.475s)
[t-SNE] Iteration 550: error = 2.7421684, gradient norm = 0.0000740 (50 iterations in 3.516s)
[t-SNE] Iteration 600: error = 2.7311780, gradient norm = 0.0000651 (50 iterations in 3.379s)
[t-SNE] Iteration 650: error = 2.7232103, gradient norm = 0.0000559 (50 iterations in 3.522s)
[t-SNE] Iteration 700: error = 2.7171319, gradient norm = 0.0000480 (50 iterations in 3.530s)
[t-SNE] Iteration 750: error = 2.7122629, gradient norm = 0.0000453 (50 iterations in 3.515s)
[t-SNE] Iteration 800: error = 2.7085164, gradient norm = 0.0000412 (50 iterations in 3.492s)
[t-SNE] Iteration 850: error = 2.7054920, gradient norm = 0.0000408 (50 iterations in 3.500s)
[t-SNE] Iteration 900: error = 2.7031596, gradient norm = 0.0000384 (50 iterations in 3.480s)
[t-SNE] Iteration 950: error = 2.7012529, gradient norm = 0.0000343 (50 iterations in 3.528s)
[t-SNE] Iteration 1000: error = 2.6996641, gradient norm = 0.0000325 (50 iterations in 3.766s)
[t-SNE] Iteration 1050: error = 2.6982090, gradient norm = 0.0000307 (50 iterations in 3.630s)
[t-SNE] Iteration 1100: error = 2.6966920, gradient norm = 0.0000290 (50 iterations in 3.644s)
[t-SNE] Iteration 1150: error = 2.6953840, gradient norm = 0.0000298 (50 iterations in 3.583s)
[t-SNE] Iteration 1200: error = 2.6941631, gradient norm = 0.0000264 (50 iterations in 3.531s)
[t-SNE] Iteration 1250: error = 2.6930017, gradient norm = 0.0000279 (50 iterations in 3.540s)
[t-SNE] Iteration 1300: error = 2.6919525, gradient norm = 0.0000285 (50 iterations in 3.525s)
[t-SNE] Iteration 1350: error = 2.6911695, gradient norm = 0.0000267 (50 iterations in 3.559s)
[t-SNE] Iteration 1400: error = 2.6903913, gradient norm = 0.0000251 (50 iterations in 3.534s)
[t-SNE] Iteration 1450: error = 2.6897430, gradient norm = 0.0000297 (50 iterations in 3.515s)
[t-SNE] Iteration 1500: error = 2.6892588, gradient norm = 0.0000243 (50 iterations in 3.552s)
[t-SNE] Iteration 1550: error = 2.6888201, gradient norm = 0.0000206 (50 iterations in 3.678s)
[t-SNE] Iteration 1600: error = 2.6882362, gradient norm = 0.0000205 (50 iterations in 3.611s)
[t-SNE] Iteration 1650: error = 2.6877160, gradient norm = 0.0000209 (50 iterations in 3.513s)
[t-SNE] Iteration 1700: error = 2.6871347, gradient norm = 0.0000209 (50 iterations in 3.531s)
[t-SNE] Iteration 1750: error = 2.6867208, gradient norm = 0.0000260 (50 iterations in 3.585s)
[t-SNE] Iteration 1800: error = 2.6863446, gradient norm = 0.0000207 (50 iterations in 3.535s)
[t-SNE] Iteration 1850: error = 2.6860075, gradient norm = 0.0000200 (50 iterations in 3.523s)
[t-SNE] Iteration 1900: error = 2.6856358, gradient norm = 0.0000192 (50 iterations in 3.532s)
[t-SNE] Iteration 1950: error = 2.6852183, gradient norm = 0.0000187 (50 iterations in 3.519s)
[t-SNE] Iteration 2000: error = 2.6848047, gradient norm = 0.0000199 (50 iterations in 3.581s)
[t-SNE] Iteration 2050: error = 2.6844678, gradient norm = 0.0000232 (50 iterations in 3.554s)
[t-SNE] Iteration 2100: error = 2.6841631, gradient norm = 0.0000183 (50 iterations in 3.547s)
[t-SNE] Iteration 2150: error = 2.6839166, gradient norm = 0.0000186 (50 iterations in 3.515s)
[t-SNE] Iteration 2200: error = 2.6836565, gradient norm = 0.0000171 (50 iterations in 3.510s)
[t-SNE] Iteration 2250: error = 2.6833870, gradient norm = 0.0000193 (50 iterations in 3.516s)
[t-SNE] Iteration 2300: error = 2.6830730, gradient norm = 0.0000182 (50 iterations in 3.535s)
[t-SNE] Iteration 2350: error = 2.6827934, gradient norm = 0.0000186 (50 iterations in 3.513s)
[t-SNE] Iteration 2400: error = 2.6825056, gradient norm = 0.0000161 (50 iterations in 3.537s)
[t-SNE] Iteration 2450: error = 2.6822679, gradient norm = 0.0000176 (50 iterations in 3.539s)
[t-SNE] Iteration 2500: error = 2.6820879, gradient norm = 0.0000172 (50 iterations in 3.532s)
[t-SNE] Iteration 2550: error = 2.6818841, gradient norm = 0.0000183 (50 iterations in 3.523s)
[t-SNE] Iteration 2600: error = 2.6816931, gradient norm = 0.0000185 (50 iterations in 3.511s)
[t-SNE] Iteration 2650: error = 2.6815860, gradient norm = 0.0000154 (50 iterations in 3.608s)
[t-SNE] Iteration 2700: error = 2.6813993, gradient norm = 0.0000179 (50 iterations in 3.552s)
[t-SNE] Iteration 2750: error = 2.6812515, gradient norm = 0.0000157 (50 iterations in 3.562s)
[t-SNE] Iteration 2800: error = 2.6810677, gradient norm = 0.0000144 (50 iterations in 3.545s)
[t-SNE] Iteration 2850: error = 2.6808078, gradient norm = 0.0000192 (50 iterations in 3.706s)
[t-SNE] Iteration 2900: error = 2.6806560, gradient norm = 0.0000149 (50 iterations in 3.643s)
[t-SNE] Iteration 2950: error = 2.6805091, gradient norm = 0.0000155 (50 iterations in 3.718s)
[t-SNE] Iteration 3000: error = 2.6803310, gradient norm = 0.0000162 (50 iterations in 3.609s)
[t-SNE] Iteration 3050: error = 2.6801550, gradient norm = 0.0000175 (50 iterations in 3.555s)
[t-SNE] Iteration 3100: error = 2.6800508, gradient norm = 0.0000137 (50 iterations in 3.564s)
[t-SNE] Iteration 3150: error = 2.6799457, gradient norm = 0.0000140 (50 iterations in 3.533s)
[t-SNE] Iteration 3200: error = 2.6798255, gradient norm = 0.0000155 (50 iterations in 3.561s)
[t-SNE] Iteration 3250: error = 2.6796944, gradient norm = 0.0000137 (50 iterations in 3.577s)
[t-SNE] Iteration 3300: error = 2.6795950, gradient norm = 0.0000144 (50 iterations in 3.549s)
[t-SNE] Iteration 3350: error = 2.6794593, gradient norm = 0.0000159 (50 iterations in 3.543s)
[t-SNE] Iteration 3400: error = 2.6793451, gradient norm = 0.0000136 (50 iterations in 3.542s)
[t-SNE] Iteration 3450: error = 2.6791687, gradient norm = 0.0000167 (50 iterations in 3.539s)
[t-SNE] Iteration 3500: error = 2.6790838, gradient norm = 0.0000138 (50 iterations in 3.548s)
[t-SNE] Iteration 3550: error = 2.6789644, gradient norm = 0.0000149 (50 iterations in 3.689s)
[t-SNE] Iteration 3600: error = 2.6788247, gradient norm = 0.0000131 (50 iterations in 3.589s)
[t-SNE] Iteration 3650: error = 2.6786931, gradient norm = 0.0000161 (50 iterations in 3.566s)
[t-SNE] Iteration 3700: error = 2.6785858, gradient norm = 0.0000158 (50 iterations in 3.603s)
[t-SNE] Iteration 3750: error = 2.6785133, gradient norm = 0.0000137 (50 iterations in 3.548s)
[t-SNE] Iteration 3800: error = 2.6784124, gradient norm = 0.0000163 (50 iterations in 3.593s)
[t-SNE] Iteration 3850: error = 2.6783018, gradient norm = 0.0000135 (50 iterations in 3.547s)
[t-SNE] Iteration 3900: error = 2.6782258, gradient norm = 0.0000133 (50 iterations in 3.546s)
[t-SNE] Iteration 3950: error = 2.6781213, gradient norm = 0.0000145 (50 iterations in 3.548s)
[t-SNE] Iteration 4000: error = 2.6780512, gradient norm = 0.0000130 (50 iterations in 3.544s)
[t-SNE] Iteration 4050: error = 2.6779144, gradient norm = 0.0000135 (50 iterations in 3.566s)
[t-SNE] Iteration 4100: error = 2.6778738, gradient norm = 0.0000136 (50 iterations in 3.559s)
[t-SNE] Iteration 4150: error = 2.6777911, gradient norm = 0.0000145 (50 iterations in 3.556s)
[t-SNE] Iteration 4200: error = 2.6776764, gradient norm = 0.0000156 (50 iterations in 3.604s)
[t-SNE] Iteration 4250: error = 2.6776352, gradient norm = 0.0000165 (50 iterations in 3.496s)
[t-SNE] Iteration 4300: error = 2.6775577, gradient norm = 0.0000152 (50 iterations in 3.124s)
[t-SNE] Iteration 4350: error = 2.6774652, gradient norm = 0.0000147 (50 iterations in 3.453s)
[t-SNE] Iteration 4400: error = 2.6774228, gradient norm = 0.0000113 (50 iterations in 3.537s)
[t-SNE] Iteration 4450: error = 2.6773553, gradient norm = 0.0000142 (50 iterations in 3.508s)
[t-SNE] Iteration 4500: error = 2.6773396, gradient norm = 0.0000143 (50 iterations in 3.564s)
[t-SNE] Iteration 4550: error = 2.6773090, gradient norm = 0.0000139 (50 iterations in 3.723s)
[t-SNE] Iteration 4600: error = 2.6772289, gradient norm = 0.0000131 (50 iterations in 3.549s)
[t-SNE] Iteration 4650: error = 2.6771603, gradient norm = 0.0000121 (50 iterations in 3.539s)
[t-SNE] Iteration 4700: error = 2.6770592, gradient norm = 0.0000126 (50 iterations in 3.506s)
[t-SNE] Iteration 4750: error = 2.6769969, gradient norm = 0.0000123 (50 iterations in 3.512s)
[t-SNE] Iteration 4800: error = 2.6768992, gradient norm = 0.0000114 (50 iterations in 3.510s)
[t-SNE] Iteration 4850: error = 2.6768427, gradient norm = 0.0000122 (50 iterations in 3.530s)
[t-SNE] Iteration 4900: error = 2.6767874, gradient norm = 0.0000158 (50 iterations in 3.523s)
[t-SNE] Iteration 4950: error = 2.6767023, gradient norm = 0.0000129 (50 iterations in 3.625s)
[t-SNE] Iteration 5000: error = 2.6766748, gradient norm = 0.0000113 (50 iterations in 3.570s)
[t-SNE] Iteration 5050: error = 2.6766121, gradient norm = 0.0000118 (50 iterations in 3.493s)
[t-SNE] Iteration 5100: error = 2.6765463, gradient norm = 0.0000122 (50 iterations in 3.519s)
[t-SNE] Iteration 5150: error = 2.6764665, gradient norm = 0.0000104 (50 iterations in 3.505s)
[t-SNE] Iteration 5200: error = 2.6764033, gradient norm = 0.0000112 (50 iterations in 3.486s)
[t-SNE] Iteration 5250: error = 2.6763384, gradient norm = 0.0000134 (50 iterations in 3.490s)
[t-SNE] Iteration 5300: error = 2.6762574, gradient norm = 0.0000147 (50 iterations in 3.505s)
[t-SNE] Iteration 5350: error = 2.6762352, gradient norm = 0.0000128 (50 iterations in 3.548s)
[t-SNE] Iteration 5400: error = 2.6762145, gradient norm = 0.0000126 (50 iterations in 3.513s)
[t-SNE] Iteration 5450: error = 2.6761110, gradient norm = 0.0000123 (50 iterations in 3.505s)
[t-SNE] Iteration 5500: error = 2.6760533, gradient norm = 0.0000117 (50 iterations in 3.504s)
[t-SNE] Iteration 5550: error = 2.6760216, gradient norm = 0.0000119 (50 iterations in 3.492s)
[t-SNE] Iteration 5600: error = 2.6759655, gradient norm = 0.0000119 (50 iterations in 3.497s)
[t-SNE] Iteration 5650: error = 2.6759613, gradient norm = 0.0000117 (50 iterations in 3.522s)
[t-SNE] Iteration 5700: error = 2.6759107, gradient norm = 0.0000111 (50 iterations in 3.501s)
[t-SNE] Iteration 5750: error = 2.6758571, gradient norm = 0.0000131 (50 iterations in 3.498s)
[t-SNE] Iteration 5800: error = 2.6757913, gradient norm = 0.0000118 (50 iterations in 3.537s)
[t-SNE] Iteration 5850: error = 2.6757245, gradient norm = 0.0000150 (50 iterations in 3.512s)
[t-SNE] Iteration 5900: error = 2.6756890, gradient norm = 0.0000120 (50 iterations in 3.514s)
[t-SNE] Iteration 5950: error = 2.6756914, gradient norm = 0.0000125 (50 iterations in 3.487s)
[t-SNE] Iteration 6000: error = 2.6756690, gradient norm = 0.0000130 (50 iterations in 3.502s)
[t-SNE] Iteration 6050: error = 2.6756134, gradient norm = 0.0000116 (50 iterations in 3.500s)
[t-SNE] Iteration 6100: error = 2.6755302, gradient norm = 0.0000179 (50 iterations in 3.516s)
[t-SNE] Iteration 6150: error = 2.6754479, gradient norm = 0.0000118 (50 iterations in 3.494s)
[t-SNE] Iteration 6200: error = 2.6754041, gradient norm = 0.0000120 (50 iterations in 3.498s)
[t-SNE] Iteration 6250: error = 2.6753786, gradient norm = 0.0000113 (50 iterations in 3.551s)
[t-SNE] Iteration 6300: error = 2.6753285, gradient norm = 0.0000109 (50 iterations in 3.650s)
[t-SNE] Iteration 6350: error = 2.6753006, gradient norm = 0.0000144 (50 iterations in 3.526s)
[t-SNE] Iteration 6400: error = 2.6752632, gradient norm = 0.0000106 (50 iterations in 3.478s)
[t-SNE] Iteration 6450: error = 2.6752510, gradient norm = 0.0000136 (50 iterations in 3.610s)
[t-SNE] Iteration 6500: error = 2.6752095, gradient norm = 0.0000135 (50 iterations in 3.635s)
[t-SNE] Iteration 6550: error = 2.6751442, gradient norm = 0.0000113 (50 iterations in 3.531s)
[t-SNE] Iteration 6600: error = 2.6751111, gradient norm = 0.0000110 (50 iterations in 3.496s)
[t-SNE] Iteration 6650: error = 2.6751318, gradient norm = 0.0000126 (50 iterations in 3.483s)
[t-SNE] Iteration 6700: error = 2.6750536, gradient norm = 0.0000108 (50 iterations in 3.595s)
[t-SNE] Iteration 6750: error = 2.6749833, gradient norm = 0.0000123 (50 iterations in 3.091s)
[t-SNE] Iteration 6800: error = 2.6749430, gradient norm = 0.0000120 (50 iterations in 3.331s)
[t-SNE] Iteration 6850: error = 2.6749403, gradient norm = 0.0000113 (50 iterations in 3.512s)
[t-SNE] Iteration 6900: error = 2.6748466, gradient norm = 0.0000147 (50 iterations in 3.528s)
[t-SNE] Iteration 6950: error = 2.6747999, gradient norm = 0.0000129 (50 iterations in 3.497s)
[t-SNE] Iteration 7000: error = 2.6748159, gradient norm = 0.0000106 (50 iterations in 3.492s)
[t-SNE] Iteration 7050: error = 2.6747859, gradient norm = 0.0000118 (50 iterations in 3.491s)
[t-SNE] Iteration 7100: error = 2.6747556, gradient norm = 0.0000119 (50 iterations in 3.516s)
[t-SNE] Iteration 7150: error = 2.6747749, gradient norm = 0.0000108 (50 iterations in 3.505s)
[t-SNE] Iteration 7200: error = 2.6747236, gradient norm = 0.0000118 (50 iterations in 3.177s)
[t-SNE] Iteration 7250: error = 2.6746738, gradient norm = 0.0000109 (50 iterations in 3.375s)
[t-SNE] Iteration 7300: error = 2.6746182, gradient norm = 0.0000127 (50 iterations in 3.648s)
[t-SNE] Iteration 7350: error = 2.6745913, gradient norm = 0.0000125 (50 iterations in 3.621s)
[t-SNE] Iteration 7400: error = 2.6745389, gradient norm = 0.0000125 (50 iterations in 3.534s)
[t-SNE] Iteration 7450: error = 2.6745043, gradient norm = 0.0000127 (50 iterations in 3.652s)
[t-SNE] Iteration 7500: error = 2.6744301, gradient norm = 0.0000130 (50 iterations in 3.562s)
[t-SNE] Iteration 7550: error = 2.6744361, gradient norm = 0.0000090 (50 iterations in 3.523s)
[t-SNE] Iteration 7600: error = 2.6744251, gradient norm = 0.0000118 (50 iterations in 3.525s)
[t-SNE] Iteration 7650: error = 2.6744537, gradient norm = 0.0000115 (50 iterations in 3.583s)
[t-SNE] Iteration 7700: error = 2.6744332, gradient norm = 0.0000114 (50 iterations in 3.543s)
[t-SNE] Iteration 7750: error = 2.6743894, gradient norm = 0.0000099 (50 iterations in 3.544s)
[t-SNE] Iteration 7800: error = 2.6743598, gradient norm = 0.0000115 (50 iterations in 3.544s)
[t-SNE] Iteration 7850: error = 2.6743224, gradient norm = 0.0000106 (50 iterations in 3.550s)
[t-SNE] Iteration 7900: error = 2.6742892, gradient norm = 0.0000110 (50 iterations in 3.550s)
[t-SNE] Iteration 7950: error = 2.6743186, gradient norm = 0.0000111 (50 iterations in 3.560s)
[t-SNE] Iteration 8000: error = 2.6742406, gradient norm = 0.0000094 (50 iterations in 3.550s)
[t-SNE] KL divergence after 8000 iterations: 2.674241
--- 585.4843001365662 seconds ---
In [43]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [44]:
two_dimensional_songs = pd.concat([train, df], axis=1)

two_dimensional_songs.head()
Out[44]:
artist song link text song_vector X Y
0 J Cole Disgusting /j/j+cole/disgusting_20910369.html Can't help but think about it all the time. \... [[-0.25725365, 0.07351082, -0.1043328, -0.0525... -31.200773 -61.504528
1 Xscape One Of Those Love Songs /x/xscape/one+of+those+love+songs_20147669.html When you're far from me there's a melody \nTh... [[-0.28554463, 0.07458883, -0.04793184, -0.022... 46.022362 29.905914
2 Kirk Franklin Better /k/kirk+franklin/better_20370473.html If I could I, I'd get away \nFar from all thi... [[-0.23288684, 0.040610936, -0.050898734, -0.0... -40.771591 35.187023
3 HIM Fade Into You /h/him/fade+into+you_20626473.html I want to hold the hand inside you \nI want t... [[-0.27420214, 0.06918051, -0.009710236, -0.02... -5.976742 71.331032
4 Queen I'm In Love With My Car /q/queen/im+in+love+with+my+car_20112603.html Oooh \nThe machine of a dream, such a clean m... [[-0.280056, 0.05401318, -0.07008333, -0.03232... -10.808808 -51.315109
In [45]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_songs['Y'],
    x = two_dimensional_songs['X'],
    text = two_dimensional_songs['song'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)
In [46]:
###plot cluster by ARTIST

print(type(two_dimensional_songs))

import plotly.express as px
fig = px.scatter(two_dimensional_songs, x='X', y='Y',color='artist')
fig.show()
<class 'pandas.core.frame.DataFrame'>

A continuación voy a problar con otro corpus. En este caso he seleccionado el archivo de IMBD

In [47]:
reviews = pd.read_csv("IMDB Dataset.csv", header=0)
#songs.head()
reviews.head()
Out[47]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [48]:
text_corpus_IMBD = []
for review in reviews['review']:
    words_review = review.lower().split()
    text_corpus_IMBD.append(words_review)



# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 50
# Minimum word count threshold.
min_word_count = 1

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7


downsampling = 1e-1

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

reviews2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

reviews2vec.build_vocab(text_corpus_IMBD)
print (len(text_corpus_IMBD))
50000
In [49]:
import time
start_time = time.time()



reviews2vec.train(text_corpus_IMBD, total_examples=reviews2vec.corpus_count, epochs=2)

if not os.path.exists("trained"):
    os.makedirs("trained")

reviews2vec.save(os.path.join("trained", "reviews2vectors.w2v"))

print("--- %s seconds ---" % (time.time() - start_time))
--- 91.64017701148987 seconds ---
In [50]:
reviews2vec = w2v.Word2Vec.load(os.path.join("trained", "reviews2vectors.w2v"))
In [51]:
reviews2vec.wv.most_similar("amazing")
Out[51]:
[('incredible', 0.9464342594146729),
 ('awesome', 0.9425973892211914),
 ('excellent', 0.9401889443397522),
 ('brilliant', 0.9375620484352112),
 ('fantastic', 0.9257223010063171),
 ('exceptional', 0.9189038276672363),
 ('outstanding', 0.9088938236236572),
 ('inspiring', 0.8922288417816162),
 ('magnificent', 0.8908681869506836),
 ('extraordinary', 0.8895775079727173)]
In [52]:
reviews2vec.wv.most_similar("thrilling")
Out[52]:
[('suspenseful', 0.9628487825393677),
 ('well-crafted', 0.9591447114944458),
 ('light-hearted', 0.9501767158508301),
 ('gripping', 0.9493370652198792),
 ('fast-paced', 0.9472066760063171),
 ('chilling', 0.945007860660553),
 ('action-packed', 0.9445863962173462),
 ('well-acted', 0.9370169043540955),
 ('overlong', 0.9359961152076721),
 ('upbeat', 0.9346778392791748)]
In [53]:
reviews2vec.wv.most_similar("strange")
Out[53]:
[('weird', 0.9504178762435913),
 ('neat', 0.9211158752441406),
 ('bizarre', 0.9136784076690674),
 ('spooky', 0.9126198291778564),
 ('creepy', 0.9111205339431763),
 ('rough', 0.9034830331802368),
 ('strange,', 0.9016885757446289),
 ('magical', 0.8997852802276611),
 ('startling', 0.8954753279685974),
 ('sensual', 0.8952290415763855)]
In [54]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = reviews2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{0} es a {1}, lo que {2} es a {3}".format(start1, end1, start2, end2))
In [55]:
nearest_similarity_cosmul("funny", "comedy", "horror")
funny es a comedy, lo que scary es a horror
In [56]:
nearest_similarity_cosmul("thrilling", "thriller", "suspense")
thrilling es a thriller, lo que excitement es a suspense

Probamos con nuevos ejemplos de funciones, como la comprobación de cercania. Esta funcion muestra las palabras del corpus que están más cerca de bad de lo que está good.

In [57]:
reviews2vec.wv.closer_than("bad", "good")
Out[57]:
['stupid', 'terrible', 'awful', 'horrible', 'lame', 'dumb', 'crappy', 'lousy']
In [58]:
reviews2vec.wv.closer_than("interesting", "thrilling")
Out[58]:
['entertaining',
 'important',
 'enjoyable',
 'realistic',
 'odd',
 'clever',
 'effective',
 'exciting',
 'amusing',
 'complex',
 'interesting.',
 'fascinating',
 'interesting,',
 'compelling',
 'disappointing',
 'unusual',
 'surprising',
 'intriguing',
 'engaging',
 'humorous',
 'unexpected',
 'extraordinary',
 'appropriate',
 'satisfying',
 'appealing',
 'suspenseful',
 'frightening',
 'worthwhile',
 'horrific',
 'comical',
 'ironic',
 'authentic',
 'gripping',
 'imaginative',
 'relevant',
 'fitting',
 'detailed',
 'daring',
 'acceptable',
 'potentially',
 'conventional',
 'different,',
 'terrifying',
 'unpleasant',
 'new,',
 'inspiring',
 'implausible',
 'innovative',
 'disjointed',
 'consistent',
 'thoughtful',
 'frustrating',
 'challenging',
 'adequate',
 'captivating',
 'educational',
 'additional',
 'admittedly',
 'plausible',
 'astonishing',
 'altogether',
 'special,',
 'thought-provoking',
 'complex,',
 'important,',
 'light-hearted',
 'insightful',
 'overlong',
 'powerful,',
 'effective.',
 'satirical',
 'uplifting',
 'startling',
 'edgy',
 'intricate',
 'pleasing',
 'tiresome',
 'engrossing',
 'stale',
 'incomprehensible',
 'memorable,',
 'filler',
 'unique,',
 'provocative',
 'unreal',
 'well-written',
 'noteworthy',
 'jarring',
 'weaker',
 'balanced',
 'rewarding',
 'effective,',
 'trivial',
 'unsatisfying',
 'affecting',
 'compelling,',
 'fascinating.',
 'informative',
 'abstract',
 'apt',
 'absorbing',
 'intriguing.',
 'sensational',
 'upbeat',
 'astounding',
 'intense,',
 'cohesive',
 'relaxed',
 'arc',
 'well-done',
 'outlandish',
 'hopeful',
 'disappointing,',
 'linear',
 'fascinating,',
 'humorous,',
 'engaging.',
 'visceral',
 'marvellous',
 'underdeveloped',
 'structured',
 'compelling.',
 'integral',
 'tricky',
 'fresh.',
 'puzzling',
 'efficient',
 'heart-warming',
 'enthralling',
 'gripping,',
 'stimulating',
 'layered',
 'arbitrary',
 'unnerving',
 'throwaway',
 'frightening.',
 'action-packed',
 'uninspiring',
 'distasteful',
 'in-depth',
 'satisfactory',
 'truthful',
 'well-crafted',
 'slow-moving',
 'inconsequential',
 'incomplete',
 'topical',
 'timely',
 'character-driven',
 'atypical',
 'exhilarating',
 'genuine,',
 'unexciting',
 'disconcerting',
 'bearable',
 'eerie,',
 'palatable',
 'above-average',
 'peripheral',
 'satisfying,',
 'maudlin',
 'telling,',
 'dream-like',
 'enlightening',
 'underwhelming',
 'emotive',
 'unimportant',
 'pertinent']

En el caso de la siguiente funcion (rank) muetra el rango de distancia entre las palabras

In [59]:
reviews2vec.wv.rank("funny", "sad")
Out[59]:
241

Calculo el vector suma normalizado de cada crítica

In [60]:
print(reviews2vec['thrilling'])
def reviewVector(row):
    vector_sum = 0
    words = row.lower().split()
    for word in words:
        vector_sum = vector_sum + reviews2vec[word]
    vector_sum = vector_sum.reshape(1,-1)
    normalised_vector_sum = sklearn.preprocessing.normalize(vector_sum)
    return normalised_vector_sum


import time
start_time = time.time()

reviews['review_vector'] = reviews['review'].apply(reviewVector)
[ 1.1408024e-01  1.0028249e+00 -4.0776709e-01  6.3194376e-01
 -5.3868401e-01  5.7180923e-01 -6.2420219e-01 -7.7755105e-01
  4.2154270e-01  3.3668098e-01  2.4965930e-01 -2.2757605e-01
  4.4364333e-01 -1.4269112e-01  1.5760416e-01 -7.1902841e-01
  1.0977198e+00  4.1651750e-01 -6.4244765e-01 -4.7974402e-01
 -2.4507371e-01  8.8866335e-01 -1.1247448e-01  3.0498073e-01
  4.4057676e-01 -5.3621702e-02  4.6920013e-02  5.1340193e-01
 -1.4576937e-01 -7.3556811e-01  3.5738945e-01 -2.3437326e-01
 -1.1153664e-01 -3.3719498e-01 -1.3971601e-01 -2.3480666e-01
  8.0455583e-01 -6.8233728e-02  1.5279391e-01  2.3973991e-01
 -6.2576985e-01  2.0438588e-01  6.3482136e-01 -3.9912593e-01
  1.5888073e-01  1.1927926e-01 -2.1548657e-01  3.4055001e-01
 -5.2444357e-04 -5.4270124e-01]

t-sne and random song selection

Los comentarios tienen 50 dimensiones cada uno. Application of t-sne is memory intensive and hence it is slightly easier on the computer to use a random sample of the 57,000 songs.

In [61]:
review_vectors = []
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size = 0.9)


for review_vector in train['review_vector']:
    review_vectors.append(review_vector)

train.head(10)
Out[61]:
review sentiment review_vector
37715 I have NEVER fallen asleep whilst watching a m... negative [[-0.076031215, 0.29945657, -0.11639837, 0.108...
16578 My goodness. This movie really really shows th... positive [[-0.046262905, 0.27993667, -0.08239518, 0.129...
10153 what ever you do do not waste your time on thi... negative [[-0.02881998, 0.28027973, -0.08297233, 0.1324...
32706 For me an unsatisfactory, unconvincing heist m... negative [[-0.04048575, 0.27070275, -0.04973773, 0.1353...
37186 When is ART going to overcome racism? I believ... positive [[-0.049328458, 0.2931728, -0.05681305, 0.1535...
44114 This is loosely based on the ideas of the orig... negative [[-0.03430819, 0.26674643, -0.05739421, 0.1419...
26494 I began watching this movie on t.v. some weeks... negative [[-0.042337447, 0.27033776, -0.051045462, 0.14...
39151 I just watched the movie tonight and i found i... positive [[-0.042164814, 0.26295108, -0.09086331, 0.122...
44517 Clifton Webb as "Mr. Scoutmaster" is one of th... positive [[-0.055760674, 0.28168395, -0.091085255, 0.12...
24149 I had been avoiding this movie for sometime...... negative [[-0.03959422, 0.28287077, -0.082041875, 0.128...

He comenzado con los mismos valores para los parametros que en el ejemplo anterior y he ido adaptando para obtener mejores resultados. A continuación se muestran algunos ejemplos de resultados

In [62]:
X = np.array(review_vectors).reshape((5000, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=10000, random_state=0, verbose=2, learning_rate=700)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5000 samples in 0.007s...
[t-SNE] Computed neighbors for 5000 samples in 3.758s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.024767
[t-SNE] Computed conditional probabilities in 0.260s
[t-SNE] Iteration 50: error = 90.7029266, gradient norm = 0.1366012 (50 iterations in 4.855s)
[t-SNE] Iteration 100: error = 91.4865189, gradient norm = 0.1249594 (50 iterations in 5.238s)
[t-SNE] Iteration 150: error = 90.7648392, gradient norm = 0.1250972 (50 iterations in 5.123s)
[t-SNE] Iteration 200: error = 91.2460556, gradient norm = 0.1288815 (50 iterations in 5.182s)
[t-SNE] Iteration 250: error = 91.4774170, gradient norm = 0.1275273 (50 iterations in 5.406s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 91.477417
[t-SNE] Iteration 300: error = 2.9247789, gradient norm = 0.0017565 (50 iterations in 4.692s)
[t-SNE] Iteration 350: error = 2.7835729, gradient norm = 0.0007194 (50 iterations in 3.585s)
[t-SNE] Iteration 400: error = 2.7257929, gradient norm = 0.0001653 (50 iterations in 3.259s)
[t-SNE] Iteration 450: error = 2.6942363, gradient norm = 0.0001106 (50 iterations in 3.191s)
[t-SNE] Iteration 500: error = 2.6747208, gradient norm = 0.0000846 (50 iterations in 3.177s)
[t-SNE] Iteration 550: error = 2.6618958, gradient norm = 0.0000666 (50 iterations in 3.249s)
[t-SNE] Iteration 600: error = 2.6532862, gradient norm = 0.0000536 (50 iterations in 3.243s)
[t-SNE] Iteration 650: error = 2.6468089, gradient norm = 0.0000476 (50 iterations in 3.365s)
[t-SNE] Iteration 700: error = 2.6421645, gradient norm = 0.0000402 (50 iterations in 3.366s)
[t-SNE] Iteration 750: error = 2.6383104, gradient norm = 0.0000360 (50 iterations in 3.390s)
[t-SNE] Iteration 800: error = 2.6352236, gradient norm = 0.0000319 (50 iterations in 3.343s)
[t-SNE] Iteration 850: error = 2.6327157, gradient norm = 0.0000304 (50 iterations in 3.385s)
[t-SNE] Iteration 900: error = 2.6306932, gradient norm = 0.0000285 (50 iterations in 3.408s)
[t-SNE] Iteration 950: error = 2.6288714, gradient norm = 0.0000279 (50 iterations in 3.446s)
[t-SNE] Iteration 1000: error = 2.6274874, gradient norm = 0.0000278 (50 iterations in 3.401s)
[t-SNE] Iteration 1050: error = 2.6263006, gradient norm = 0.0000235 (50 iterations in 3.334s)
[t-SNE] Iteration 1100: error = 2.6251543, gradient norm = 0.0000234 (50 iterations in 3.348s)
[t-SNE] Iteration 1150: error = 2.6241810, gradient norm = 0.0000220 (50 iterations in 3.348s)
[t-SNE] Iteration 1200: error = 2.6232665, gradient norm = 0.0000203 (50 iterations in 3.352s)
[t-SNE] Iteration 1250: error = 2.6224217, gradient norm = 0.0000196 (50 iterations in 3.354s)
[t-SNE] Iteration 1300: error = 2.6216736, gradient norm = 0.0000184 (50 iterations in 3.375s)
[t-SNE] Iteration 1350: error = 2.6209962, gradient norm = 0.0000190 (50 iterations in 3.372s)
[t-SNE] Iteration 1400: error = 2.6204062, gradient norm = 0.0000172 (50 iterations in 3.362s)
[t-SNE] Iteration 1450: error = 2.6199141, gradient norm = 0.0000174 (50 iterations in 3.380s)
[t-SNE] Iteration 1500: error = 2.6195557, gradient norm = 0.0000152 (50 iterations in 3.355s)
[t-SNE] Iteration 1550: error = 2.6191893, gradient norm = 0.0000137 (50 iterations in 3.359s)
[t-SNE] Iteration 1600: error = 2.6187987, gradient norm = 0.0000141 (50 iterations in 3.377s)
[t-SNE] Iteration 1650: error = 2.6183815, gradient norm = 0.0000138 (50 iterations in 3.388s)
[t-SNE] Iteration 1700: error = 2.6180947, gradient norm = 0.0000131 (50 iterations in 3.802s)
[t-SNE] Iteration 1750: error = 2.6177673, gradient norm = 0.0000122 (50 iterations in 3.451s)
[t-SNE] Iteration 1800: error = 2.6174755, gradient norm = 0.0000160 (50 iterations in 3.371s)
[t-SNE] Iteration 1850: error = 2.6172464, gradient norm = 0.0000123 (50 iterations in 3.389s)
[t-SNE] Iteration 1900: error = 2.6170378, gradient norm = 0.0000112 (50 iterations in 3.461s)
[t-SNE] Iteration 1950: error = 2.6168265, gradient norm = 0.0000120 (50 iterations in 3.585s)
[t-SNE] Iteration 2000: error = 2.6165652, gradient norm = 0.0000123 (50 iterations in 3.497s)
[t-SNE] Iteration 2050: error = 2.6163993, gradient norm = 0.0000127 (50 iterations in 2.976s)
[t-SNE] Iteration 2100: error = 2.6162214, gradient norm = 0.0000121 (50 iterations in 3.381s)
[t-SNE] Iteration 2150: error = 2.6160643, gradient norm = 0.0000103 (50 iterations in 3.386s)
[t-SNE] Iteration 2200: error = 2.6158779, gradient norm = 0.0000111 (50 iterations in 3.416s)
[t-SNE] Iteration 2250: error = 2.6157591, gradient norm = 0.0000117 (50 iterations in 3.404s)
[t-SNE] Iteration 2300: error = 2.6156280, gradient norm = 0.0000111 (50 iterations in 3.401s)
[t-SNE] Iteration 2350: error = 2.6154397, gradient norm = 0.0000106 (50 iterations in 3.435s)
[t-SNE] Iteration 2400: error = 2.6153023, gradient norm = 0.0000108 (50 iterations in 3.388s)
[t-SNE] Iteration 2450: error = 2.6151409, gradient norm = 0.0000101 (50 iterations in 3.401s)
[t-SNE] Iteration 2500: error = 2.6150577, gradient norm = 0.0000090 (50 iterations in 3.401s)
[t-SNE] Iteration 2550: error = 2.6149466, gradient norm = 0.0000099 (50 iterations in 3.381s)
[t-SNE] Iteration 2600: error = 2.6148190, gradient norm = 0.0000096 (50 iterations in 3.451s)
[t-SNE] Iteration 2650: error = 2.6146865, gradient norm = 0.0000093 (50 iterations in 3.391s)
[t-SNE] Iteration 2700: error = 2.6145692, gradient norm = 0.0000114 (50 iterations in 3.389s)
[t-SNE] Iteration 2750: error = 2.6145096, gradient norm = 0.0000085 (50 iterations in 3.506s)
[t-SNE] Iteration 2800: error = 2.6144106, gradient norm = 0.0000101 (50 iterations in 3.469s)
[t-SNE] Iteration 2850: error = 2.6143153, gradient norm = 0.0000099 (50 iterations in 3.371s)
[t-SNE] Iteration 2900: error = 2.6142375, gradient norm = 0.0000106 (50 iterations in 3.393s)
[t-SNE] Iteration 2950: error = 2.6141417, gradient norm = 0.0000090 (50 iterations in 3.386s)
[t-SNE] Iteration 3000: error = 2.6141045, gradient norm = 0.0000098 (50 iterations in 3.393s)
[t-SNE] Iteration 3050: error = 2.6140149, gradient norm = 0.0000090 (50 iterations in 3.411s)
[t-SNE] Iteration 3100: error = 2.6139464, gradient norm = 0.0000089 (50 iterations in 3.433s)
[t-SNE] Iteration 3150: error = 2.6138868, gradient norm = 0.0000092 (50 iterations in 3.399s)
[t-SNE] Iteration 3200: error = 2.6138036, gradient norm = 0.0000075 (50 iterations in 3.395s)
[t-SNE] Iteration 3250: error = 2.6137509, gradient norm = 0.0000096 (50 iterations in 3.409s)
[t-SNE] Iteration 3300: error = 2.6136813, gradient norm = 0.0000102 (50 iterations in 3.398s)
[t-SNE] Iteration 3350: error = 2.6136222, gradient norm = 0.0000093 (50 iterations in 3.397s)
[t-SNE] Iteration 3400: error = 2.6135623, gradient norm = 0.0000102 (50 iterations in 3.397s)
[t-SNE] Iteration 3450: error = 2.6135004, gradient norm = 0.0000093 (50 iterations in 3.531s)
[t-SNE] Iteration 3500: error = 2.6134551, gradient norm = 0.0000097 (50 iterations in 3.431s)
[t-SNE] Iteration 3550: error = 2.6134074, gradient norm = 0.0000095 (50 iterations in 3.395s)
[t-SNE] Iteration 3600: error = 2.6133907, gradient norm = 0.0000074 (50 iterations in 3.405s)
[t-SNE] Iteration 3650: error = 2.6133113, gradient norm = 0.0000088 (50 iterations in 3.387s)
[t-SNE] Iteration 3700: error = 2.6132772, gradient norm = 0.0000093 (50 iterations in 3.388s)
[t-SNE] Iteration 3750: error = 2.6132085, gradient norm = 0.0000081 (50 iterations in 3.398s)
[t-SNE] Iteration 3800: error = 2.6131518, gradient norm = 0.0000082 (50 iterations in 3.436s)
[t-SNE] Iteration 3850: error = 2.6131055, gradient norm = 0.0000092 (50 iterations in 3.505s)
[t-SNE] Iteration 3900: error = 2.6131043, gradient norm = 0.0000071 (50 iterations in 3.756s)
[t-SNE] Iteration 3950: error = 2.6130631, gradient norm = 0.0000089 (50 iterations in 3.571s)
[t-SNE] Iteration 4000: error = 2.6130116, gradient norm = 0.0000075 (50 iterations in 3.647s)
[t-SNE] Iteration 4050: error = 2.6130016, gradient norm = 0.0000065 (50 iterations in 3.525s)
[t-SNE] Iteration 4100: error = 2.6129427, gradient norm = 0.0000068 (50 iterations in 3.506s)
[t-SNE] Iteration 4150: error = 2.6128845, gradient norm = 0.0000073 (50 iterations in 3.471s)
[t-SNE] Iteration 4200: error = 2.6128416, gradient norm = 0.0000066 (50 iterations in 3.483s)
[t-SNE] Iteration 4250: error = 2.6128223, gradient norm = 0.0000079 (50 iterations in 3.468s)
[t-SNE] Iteration 4300: error = 2.6127911, gradient norm = 0.0000069 (50 iterations in 3.561s)
[t-SNE] Iteration 4350: error = 2.6127396, gradient norm = 0.0000082 (50 iterations in 3.553s)
[t-SNE] Iteration 4400: error = 2.6127059, gradient norm = 0.0000072 (50 iterations in 3.502s)
[t-SNE] Iteration 4450: error = 2.6126354, gradient norm = 0.0000085 (50 iterations in 3.471s)
[t-SNE] Iteration 4500: error = 2.6126363, gradient norm = 0.0000072 (50 iterations in 3.482s)
[t-SNE] Iteration 4550: error = 2.6126268, gradient norm = 0.0000078 (50 iterations in 3.475s)
[t-SNE] Iteration 4600: error = 2.6125979, gradient norm = 0.0000072 (50 iterations in 3.500s)
[t-SNE] Iteration 4650: error = 2.6125610, gradient norm = 0.0000081 (50 iterations in 3.471s)
[t-SNE] Iteration 4700: error = 2.6125557, gradient norm = 0.0000092 (50 iterations in 3.569s)
[t-SNE] Iteration 4750: error = 2.6125157, gradient norm = 0.0000078 (50 iterations in 3.487s)
[t-SNE] Iteration 4800: error = 2.6124704, gradient norm = 0.0000080 (50 iterations in 3.495s)
[t-SNE] Iteration 4850: error = 2.6124458, gradient norm = 0.0000078 (50 iterations in 3.538s)
[t-SNE] Iteration 4900: error = 2.6124282, gradient norm = 0.0000075 (50 iterations in 3.643s)
[t-SNE] Iteration 4950: error = 2.6123879, gradient norm = 0.0000092 (50 iterations in 3.526s)
[t-SNE] Iteration 5000: error = 2.6123588, gradient norm = 0.0000081 (50 iterations in 3.597s)
[t-SNE] Iteration 5050: error = 2.6123402, gradient norm = 0.0000081 (50 iterations in 3.468s)
[t-SNE] Iteration 5100: error = 2.6123116, gradient norm = 0.0000078 (50 iterations in 3.484s)
[t-SNE] Iteration 5150: error = 2.6122804, gradient norm = 0.0000068 (50 iterations in 3.492s)
[t-SNE] Iteration 5200: error = 2.6122808, gradient norm = 0.0000084 (50 iterations in 3.478s)
[t-SNE] Iteration 5250: error = 2.6122613, gradient norm = 0.0000080 (50 iterations in 3.510s)
[t-SNE] Iteration 5300: error = 2.6122193, gradient norm = 0.0000079 (50 iterations in 3.551s)
[t-SNE] Iteration 5350: error = 2.6122150, gradient norm = 0.0000068 (50 iterations in 3.489s)
[t-SNE] Iteration 5400: error = 2.6121943, gradient norm = 0.0000080 (50 iterations in 3.481s)
[t-SNE] Iteration 5450: error = 2.6121180, gradient norm = 0.0000080 (50 iterations in 3.502s)
[t-SNE] Iteration 5500: error = 2.6121755, gradient norm = 0.0000062 (50 iterations in 3.488s)
[t-SNE] Iteration 5550: error = 2.6121349, gradient norm = 0.0000067 (50 iterations in 3.470s)
[t-SNE] Iteration 5600: error = 2.6121101, gradient norm = 0.0000081 (50 iterations in 3.481s)
[t-SNE] Iteration 5650: error = 2.6121078, gradient norm = 0.0000089 (50 iterations in 3.459s)
[t-SNE] Iteration 5700: error = 2.6120820, gradient norm = 0.0000080 (50 iterations in 3.543s)
[t-SNE] Iteration 5750: error = 2.6120768, gradient norm = 0.0000076 (50 iterations in 3.611s)
[t-SNE] Iteration 5800: error = 2.6120374, gradient norm = 0.0000087 (50 iterations in 3.549s)
[t-SNE] Iteration 5850: error = 2.6120329, gradient norm = 0.0000078 (50 iterations in 3.507s)
[t-SNE] Iteration 5900: error = 2.6120236, gradient norm = 0.0000075 (50 iterations in 3.581s)
[t-SNE] Iteration 5950: error = 2.6120083, gradient norm = 0.0000079 (50 iterations in 3.564s)
[t-SNE] Iteration 6000: error = 2.6119668, gradient norm = 0.0000073 (50 iterations in 3.483s)
[t-SNE] Iteration 6050: error = 2.6119401, gradient norm = 0.0000081 (50 iterations in 3.499s)
[t-SNE] Iteration 6100: error = 2.6119602, gradient norm = 0.0000078 (50 iterations in 3.523s)
[t-SNE] Iteration 6150: error = 2.6119411, gradient norm = 0.0000070 (50 iterations in 3.504s)
[t-SNE] Iteration 6200: error = 2.6119177, gradient norm = 0.0000077 (50 iterations in 3.475s)
[t-SNE] Iteration 6250: error = 2.6118889, gradient norm = 0.0000089 (50 iterations in 3.468s)
[t-SNE] Iteration 6300: error = 2.6118712, gradient norm = 0.0000086 (50 iterations in 3.481s)
[t-SNE] Iteration 6350: error = 2.6118581, gradient norm = 0.0000086 (50 iterations in 3.495s)
[t-SNE] Iteration 6400: error = 2.6118402, gradient norm = 0.0000068 (50 iterations in 3.498s)
[t-SNE] Iteration 6450: error = 2.6118324, gradient norm = 0.0000078 (50 iterations in 3.481s)
[t-SNE] Iteration 6500: error = 2.6118484, gradient norm = 0.0000083 (50 iterations in 3.518s)
[t-SNE] Iteration 6550: error = 2.6118205, gradient norm = 0.0000092 (50 iterations in 3.494s)
[t-SNE] Iteration 6600: error = 2.6117935, gradient norm = 0.0000093 (50 iterations in 3.487s)
[t-SNE] Iteration 6650: error = 2.6117887, gradient norm = 0.0000092 (50 iterations in 3.500s)
[t-SNE] Iteration 6700: error = 2.6117766, gradient norm = 0.0000087 (50 iterations in 3.478s)
[t-SNE] Iteration 6750: error = 2.6117654, gradient norm = 0.0000091 (50 iterations in 3.639s)
[t-SNE] Iteration 6800: error = 2.6117547, gradient norm = 0.0000086 (50 iterations in 3.556s)
[t-SNE] Iteration 6850: error = 2.6117532, gradient norm = 0.0000085 (50 iterations in 3.478s)
[t-SNE] Iteration 6900: error = 2.6117492, gradient norm = 0.0000088 (50 iterations in 3.482s)
[t-SNE] Iteration 6950: error = 2.6117291, gradient norm = 0.0000082 (50 iterations in 3.550s)
[t-SNE] Iteration 7000: error = 2.6117034, gradient norm = 0.0000088 (50 iterations in 3.556s)
[t-SNE] Iteration 7050: error = 2.6116951, gradient norm = 0.0000088 (50 iterations in 3.512s)
[t-SNE] Iteration 7100: error = 2.6116517, gradient norm = 0.0000067 (50 iterations in 3.481s)
[t-SNE] Iteration 7150: error = 2.6116700, gradient norm = 0.0000091 (50 iterations in 3.477s)
[t-SNE] Iteration 7200: error = 2.6116431, gradient norm = 0.0000087 (50 iterations in 3.480s)
[t-SNE] Iteration 7250: error = 2.6116118, gradient norm = 0.0000088 (50 iterations in 3.481s)
[t-SNE] Iteration 7300: error = 2.6116214, gradient norm = 0.0000070 (50 iterations in 3.489s)
[t-SNE] Iteration 7350: error = 2.6115913, gradient norm = 0.0000076 (50 iterations in 3.506s)
[t-SNE] Iteration 7400: error = 2.6115825, gradient norm = 0.0000075 (50 iterations in 3.524s)
[t-SNE] Iteration 7450: error = 2.6115623, gradient norm = 0.0000083 (50 iterations in 3.491s)
[t-SNE] Iteration 7500: error = 2.6115682, gradient norm = 0.0000084 (50 iterations in 3.514s)
[t-SNE] Iteration 7550: error = 2.6115353, gradient norm = 0.0000083 (50 iterations in 3.484s)
[t-SNE] Iteration 7600: error = 2.6115427, gradient norm = 0.0000078 (50 iterations in 3.478s)
[t-SNE] Iteration 7650: error = 2.6115592, gradient norm = 0.0000075 (50 iterations in 3.479s)
[t-SNE] Iteration 7700: error = 2.6115322, gradient norm = 0.0000082 (50 iterations in 3.531s)
[t-SNE] Iteration 7750: error = 2.6114593, gradient norm = 0.0000072 (50 iterations in 3.615s)
[t-SNE] Iteration 7800: error = 2.6115143, gradient norm = 0.0000089 (50 iterations in 3.547s)
[t-SNE] Iteration 7850: error = 2.6114869, gradient norm = 0.0000071 (50 iterations in 3.494s)
[t-SNE] Iteration 7900: error = 2.6114852, gradient norm = 0.0000081 (50 iterations in 3.558s)
[t-SNE] Iteration 7950: error = 2.6114788, gradient norm = 0.0000075 (50 iterations in 3.492s)
[t-SNE] Iteration 8000: error = 2.6114829, gradient norm = 0.0000094 (50 iterations in 3.467s)
[t-SNE] Iteration 8050: error = 2.6114483, gradient norm = 0.0000082 (50 iterations in 3.488s)
[t-SNE] Iteration 8100: error = 2.6114318, gradient norm = 0.0000084 (50 iterations in 3.476s)
[t-SNE] Iteration 8150: error = 2.6114428, gradient norm = 0.0000083 (50 iterations in 3.518s)
[t-SNE] Iteration 8200: error = 2.6114237, gradient norm = 0.0000069 (50 iterations in 3.496s)
[t-SNE] Iteration 8250: error = 2.6114161, gradient norm = 0.0000077 (50 iterations in 3.531s)
[t-SNE] Iteration 8300: error = 2.6114123, gradient norm = 0.0000088 (50 iterations in 3.160s)
[t-SNE] Iteration 8350: error = 2.6114223, gradient norm = 0.0000090 (50 iterations in 2.990s)
[t-SNE] Iteration 8400: error = 2.6113632, gradient norm = 0.0000089 (50 iterations in 3.542s)
[t-SNE] Iteration 8450: error = 2.6113687, gradient norm = 0.0000081 (50 iterations in 3.655s)
[t-SNE] Iteration 8500: error = 2.6113811, gradient norm = 0.0000078 (50 iterations in 3.621s)
[t-SNE] Iteration 8550: error = 2.6113777, gradient norm = 0.0000101 (50 iterations in 3.474s)
[t-SNE] Iteration 8600: error = 2.6113701, gradient norm = 0.0000081 (50 iterations in 3.501s)
[t-SNE] Iteration 8650: error = 2.6113439, gradient norm = 0.0000079 (50 iterations in 3.469s)
[t-SNE] Iteration 8700: error = 2.6113458, gradient norm = 0.0000087 (50 iterations in 3.498s)
[t-SNE] Iteration 8750: error = 2.6113346, gradient norm = 0.0000075 (50 iterations in 3.488s)
[t-SNE] Iteration 8800: error = 2.6113205, gradient norm = 0.0000080 (50 iterations in 3.478s)
[t-SNE] Iteration 8850: error = 2.6113229, gradient norm = 0.0000066 (50 iterations in 3.474s)
[t-SNE] Iteration 8900: error = 2.6112950, gradient norm = 0.0000080 (50 iterations in 3.495s)
[t-SNE] Iteration 8950: error = 2.6112802, gradient norm = 0.0000075 (50 iterations in 3.492s)
[t-SNE] Iteration 9000: error = 2.6112652, gradient norm = 0.0000069 (50 iterations in 3.491s)
[t-SNE] Iteration 9050: error = 2.6112688, gradient norm = 0.0000076 (50 iterations in 3.497s)
[t-SNE] Iteration 9100: error = 2.6112738, gradient norm = 0.0000078 (50 iterations in 3.511s)
[t-SNE] Iteration 9150: error = 2.6112757, gradient norm = 0.0000068 (50 iterations in 3.522s)
[t-SNE] Iteration 9200: error = 2.6112676, gradient norm = 0.0000079 (50 iterations in 3.587s)
[t-SNE] Iteration 9250: error = 2.6112599, gradient norm = 0.0000085 (50 iterations in 3.507s)
[t-SNE] Iteration 9300: error = 2.6112590, gradient norm = 0.0000074 (50 iterations in 3.492s)
[t-SNE] Iteration 9350: error = 2.6112378, gradient norm = 0.0000079 (50 iterations in 3.603s)
[t-SNE] Iteration 9400: error = 2.6112494, gradient norm = 0.0000069 (50 iterations in 3.518s)
[t-SNE] Iteration 9450: error = 2.6112316, gradient norm = 0.0000076 (50 iterations in 3.495s)
[t-SNE] Iteration 9500: error = 2.6112294, gradient norm = 0.0000071 (50 iterations in 3.486s)
[t-SNE] Iteration 9550: error = 2.6112261, gradient norm = 0.0000085 (50 iterations in 3.585s)
[t-SNE] Iteration 9600: error = 2.6112161, gradient norm = 0.0000074 (50 iterations in 3.544s)
[t-SNE] Iteration 9650: error = 2.6112001, gradient norm = 0.0000080 (50 iterations in 3.477s)
[t-SNE] Iteration 9700: error = 2.6111925, gradient norm = 0.0000077 (50 iterations in 3.488s)
[t-SNE] Iteration 9750: error = 2.6111760, gradient norm = 0.0000078 (50 iterations in 3.552s)
[t-SNE] Iteration 9800: error = 2.6111801, gradient norm = 0.0000078 (50 iterations in 3.487s)
[t-SNE] Iteration 9850: error = 2.6111116, gradient norm = 0.0000070 (50 iterations in 3.606s)
[t-SNE] Iteration 9900: error = 2.6111605, gradient norm = 0.0000083 (50 iterations in 3.545s)
[t-SNE] Iteration 9950: error = 2.6111493, gradient norm = 0.0000081 (50 iterations in 3.545s)
[t-SNE] Iteration 10000: error = 2.6111314, gradient norm = 0.0000071 (50 iterations in 3.501s)
[t-SNE] KL divergence after 10000 iterations: 2.611131
--- 707.0673179626465 seconds ---
In [63]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [64]:
two_dimensional_reviews = pd.concat([train, df], axis=1)

two_dimensional_reviews.head()
Out[64]:
review sentiment review_vector X Y
0 I have NEVER fallen asleep whilst watching a m... negative [[-0.076031215, 0.29945657, -0.11639837, 0.108... -3.319987 -93.460754
1 My goodness. This movie really really shows th... positive [[-0.046262905, 0.27993667, -0.08239518, 0.129... 20.400761 -22.543306
2 what ever you do do not waste your time on thi... negative [[-0.02881998, 0.28027973, -0.08297233, 0.1324... 10.868239 -45.770203
3 For me an unsatisfactory, unconvincing heist m... negative [[-0.04048575, 0.27070275, -0.04973773, 0.1353... -23.306194 60.429176
4 When is ART going to overcome racism? I believ... positive [[-0.049328458, 0.2931728, -0.05681305, 0.1535... -42.044830 8.705990
In [65]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_reviews['Y'],
    x = two_dimensional_reviews['X'],
    text = two_dimensional_reviews['review'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]

iplot(data)

A continuación se muestran las críticas, destacando con el color si el sentimiento de estas es positivo o negativo

In [66]:
print(type(two_dimensional_reviews))

import plotly.express as px
fig = px.scatter(two_dimensional_reviews, x='X', y='Y',color='sentiment')
fig.show()
<class 'pandas.core.frame.DataFrame'>

Cambiando perplexity

In [67]:
X = np.array(review_vectors).reshape((5000, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=10000, random_state=0, verbose=2, learning_rate=700, perplexity = 40)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 5000 samples in 0.011s...
[t-SNE] Computed neighbors for 5000 samples in 2.690s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.025903
[t-SNE] Computed conditional probabilities in 0.298s
[t-SNE] Iteration 50: error = 86.5489273, gradient norm = 0.1335695 (50 iterations in 4.604s)
[t-SNE] Iteration 100: error = 87.8285294, gradient norm = 0.1181506 (50 iterations in 4.494s)
[t-SNE] Iteration 150: error = 87.4401398, gradient norm = 0.1206046 (50 iterations in 3.761s)
[t-SNE] Iteration 200: error = 87.0727844, gradient norm = 0.1231918 (50 iterations in 5.167s)
[t-SNE] Iteration 250: error = 87.4383469, gradient norm = 0.1219909 (50 iterations in 5.340s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 87.438347
[t-SNE] Iteration 300: error = 2.8622563, gradient norm = 0.0014122 (50 iterations in 4.067s)
[t-SNE] Iteration 350: error = 2.7083178, gradient norm = 0.0007481 (50 iterations in 3.538s)
[t-SNE] Iteration 400: error = 2.6587889, gradient norm = 0.0001594 (50 iterations in 2.964s)
[t-SNE] Iteration 450: error = 2.6345301, gradient norm = 0.0001019 (50 iterations in 3.103s)
[t-SNE] Iteration 500: error = 2.6207860, gradient norm = 0.0000664 (50 iterations in 3.634s)
[t-SNE] Iteration 550: error = 2.6117115, gradient norm = 0.0000550 (50 iterations in 3.647s)
[t-SNE] Iteration 600: error = 2.6058686, gradient norm = 0.0000437 (50 iterations in 3.685s)
[t-SNE] Iteration 650: error = 2.6015208, gradient norm = 0.0000372 (50 iterations in 3.698s)
[t-SNE] Iteration 700: error = 2.5984640, gradient norm = 0.0000354 (50 iterations in 3.736s)
[t-SNE] Iteration 750: error = 2.5964720, gradient norm = 0.0000329 (50 iterations in 3.721s)
[t-SNE] Iteration 800: error = 2.5942793, gradient norm = 0.0000282 (50 iterations in 3.803s)
[t-SNE] Iteration 850: error = 2.5925019, gradient norm = 0.0000256 (50 iterations in 3.716s)
[t-SNE] Iteration 900: error = 2.5910587, gradient norm = 0.0000235 (50 iterations in 3.738s)
[t-SNE] Iteration 950: error = 2.5898259, gradient norm = 0.0000202 (50 iterations in 3.741s)
[t-SNE] Iteration 1000: error = 2.5887451, gradient norm = 0.0000195 (50 iterations in 3.824s)
[t-SNE] Iteration 1050: error = 2.5877900, gradient norm = 0.0000193 (50 iterations in 3.770s)
[t-SNE] Iteration 1100: error = 2.5869396, gradient norm = 0.0000199 (50 iterations in 3.772s)
[t-SNE] Iteration 1150: error = 2.5863068, gradient norm = 0.0000186 (50 iterations in 3.746s)
[t-SNE] Iteration 1200: error = 2.5856814, gradient norm = 0.0000165 (50 iterations in 3.752s)
[t-SNE] Iteration 1250: error = 2.5851221, gradient norm = 0.0000176 (50 iterations in 3.753s)
[t-SNE] Iteration 1300: error = 2.5846763, gradient norm = 0.0000149 (50 iterations in 3.889s)
[t-SNE] Iteration 1350: error = 2.5842521, gradient norm = 0.0000148 (50 iterations in 3.905s)
[t-SNE] Iteration 1400: error = 2.5839324, gradient norm = 0.0000176 (50 iterations in 3.780s)
[t-SNE] Iteration 1450: error = 2.5836055, gradient norm = 0.0000145 (50 iterations in 3.776s)
[t-SNE] Iteration 1500: error = 2.5832875, gradient norm = 0.0000130 (50 iterations in 3.773s)
[t-SNE] Iteration 1550: error = 2.5829825, gradient norm = 0.0000141 (50 iterations in 3.779s)
[t-SNE] Iteration 1600: error = 2.5827503, gradient norm = 0.0000115 (50 iterations in 3.803s)
[t-SNE] Iteration 1650: error = 2.5824978, gradient norm = 0.0000125 (50 iterations in 3.769s)
[t-SNE] Iteration 1700: error = 2.5822585, gradient norm = 0.0000113 (50 iterations in 3.773s)
[t-SNE] Iteration 1750: error = 2.5820081, gradient norm = 0.0000116 (50 iterations in 3.752s)
[t-SNE] Iteration 1800: error = 2.5817461, gradient norm = 0.0000123 (50 iterations in 3.855s)
[t-SNE] Iteration 1850: error = 2.5815136, gradient norm = 0.0000122 (50 iterations in 3.842s)
[t-SNE] Iteration 1900: error = 2.5813434, gradient norm = 0.0000109 (50 iterations in 3.763s)
[t-SNE] Iteration 1950: error = 2.5812032, gradient norm = 0.0000121 (50 iterations in 3.680s)
[t-SNE] Iteration 2000: error = 2.5811198, gradient norm = 0.0000104 (50 iterations in 3.786s)
[t-SNE] Iteration 2050: error = 2.5809662, gradient norm = 0.0000106 (50 iterations in 3.788s)
[t-SNE] Iteration 2100: error = 2.5808876, gradient norm = 0.0000104 (50 iterations in 3.804s)
[t-SNE] Iteration 2150: error = 2.5807645, gradient norm = 0.0000095 (50 iterations in 3.806s)
[t-SNE] Iteration 2200: error = 2.5806801, gradient norm = 0.0000099 (50 iterations in 3.856s)
[t-SNE] Iteration 2250: error = 2.5805898, gradient norm = 0.0000104 (50 iterations in 3.819s)
[t-SNE] Iteration 2300: error = 2.5805147, gradient norm = 0.0000107 (50 iterations in 3.845s)
[t-SNE] Iteration 2350: error = 2.5804482, gradient norm = 0.0000092 (50 iterations in 3.866s)
[t-SNE] Iteration 2400: error = 2.5803804, gradient norm = 0.0000081 (50 iterations in 3.849s)
[t-SNE] Iteration 2450: error = 2.5802577, gradient norm = 0.0000075 (50 iterations in 3.813s)
[t-SNE] Iteration 2500: error = 2.5801880, gradient norm = 0.0000091 (50 iterations in 3.842s)
[t-SNE] Iteration 2550: error = 2.5800822, gradient norm = 0.0000080 (50 iterations in 3.807s)
[t-SNE] Iteration 2600: error = 2.5800242, gradient norm = 0.0000088 (50 iterations in 3.837s)
[t-SNE] Iteration 2650: error = 2.5799322, gradient norm = 0.0000093 (50 iterations in 3.889s)
[t-SNE] Iteration 2700: error = 2.5798235, gradient norm = 0.0000082 (50 iterations in 3.900s)
[t-SNE] Iteration 2750: error = 2.5797639, gradient norm = 0.0000078 (50 iterations in 3.520s)
[t-SNE] Iteration 2800: error = 2.5797200, gradient norm = 0.0000071 (50 iterations in 3.387s)
[t-SNE] Iteration 2850: error = 2.5796955, gradient norm = 0.0000065 (50 iterations in 3.852s)
[t-SNE] Iteration 2900: error = 2.5796461, gradient norm = 0.0000069 (50 iterations in 3.812s)
[t-SNE] Iteration 2950: error = 2.5796061, gradient norm = 0.0000073 (50 iterations in 3.769s)
[t-SNE] Iteration 3000: error = 2.5795364, gradient norm = 0.0000069 (50 iterations in 3.807s)
[t-SNE] Iteration 3050: error = 2.5794539, gradient norm = 0.0000079 (50 iterations in 3.913s)
[t-SNE] Iteration 3100: error = 2.5794127, gradient norm = 0.0000080 (50 iterations in 4.005s)
[t-SNE] Iteration 3150: error = 2.5793967, gradient norm = 0.0000069 (50 iterations in 3.832s)
[t-SNE] Iteration 3200: error = 2.5793509, gradient norm = 0.0000075 (50 iterations in 3.852s)
[t-SNE] Iteration 3250: error = 2.5793221, gradient norm = 0.0000068 (50 iterations in 3.652s)
[t-SNE] Iteration 3300: error = 2.5792499, gradient norm = 0.0000087 (50 iterations in 3.199s)
[t-SNE] Iteration 3350: error = 2.5791948, gradient norm = 0.0000078 (50 iterations in 3.366s)
[t-SNE] Iteration 3400: error = 2.5791106, gradient norm = 0.0000107 (50 iterations in 3.818s)
[t-SNE] Iteration 3450: error = 2.5790894, gradient norm = 0.0000076 (50 iterations in 3.792s)
[t-SNE] Iteration 3500: error = 2.5790665, gradient norm = 0.0000073 (50 iterations in 3.860s)
[t-SNE] Iteration 3550: error = 2.5789864, gradient norm = 0.0000062 (50 iterations in 3.900s)
[t-SNE] Iteration 3600: error = 2.5789700, gradient norm = 0.0000073 (50 iterations in 3.878s)
[t-SNE] Iteration 3650: error = 2.5789223, gradient norm = 0.0000065 (50 iterations in 3.809s)
[t-SNE] Iteration 3700: error = 2.5789161, gradient norm = 0.0000078 (50 iterations in 3.776s)
[t-SNE] Iteration 3750: error = 2.5789044, gradient norm = 0.0000071 (50 iterations in 3.813s)
[t-SNE] Iteration 3800: error = 2.5788682, gradient norm = 0.0000085 (50 iterations in 3.831s)
[t-SNE] Iteration 3850: error = 2.5788410, gradient norm = 0.0000066 (50 iterations in 3.818s)
[t-SNE] Iteration 3900: error = 2.5787938, gradient norm = 0.0000067 (50 iterations in 3.831s)
[t-SNE] Iteration 3950: error = 2.5787406, gradient norm = 0.0000077 (50 iterations in 3.826s)
[t-SNE] Iteration 4000: error = 2.5787313, gradient norm = 0.0000059 (50 iterations in 3.296s)
[t-SNE] Iteration 4050: error = 2.5787172, gradient norm = 0.0000055 (50 iterations in 3.731s)
[t-SNE] Iteration 4100: error = 2.5787034, gradient norm = 0.0000063 (50 iterations in 3.808s)
[t-SNE] Iteration 4150: error = 2.5786853, gradient norm = 0.0000070 (50 iterations in 4.005s)
[t-SNE] Iteration 4200: error = 2.5786536, gradient norm = 0.0000065 (50 iterations in 3.869s)
[t-SNE] Iteration 4250: error = 2.5785830, gradient norm = 0.0000077 (50 iterations in 3.829s)
[t-SNE] Iteration 4300: error = 2.5785966, gradient norm = 0.0000071 (50 iterations in 3.822s)
[t-SNE] Iteration 4350: error = 2.5785530, gradient norm = 0.0000075 (50 iterations in 3.857s)
[t-SNE] Iteration 4400: error = 2.5785429, gradient norm = 0.0000058 (50 iterations in 3.914s)
[t-SNE] Iteration 4450: error = 2.5785372, gradient norm = 0.0000057 (50 iterations in 3.903s)
[t-SNE] Iteration 4500: error = 2.5785055, gradient norm = 0.0000063 (50 iterations in 3.454s)
[t-SNE] Iteration 4550: error = 2.5784850, gradient norm = 0.0000066 (50 iterations in 3.758s)
[t-SNE] Iteration 4600: error = 2.5784283, gradient norm = 0.0000054 (50 iterations in 3.900s)
[t-SNE] Iteration 4650: error = 2.5784400, gradient norm = 0.0000064 (50 iterations in 3.826s)
[t-SNE] Iteration 4700: error = 2.5784051, gradient norm = 0.0000064 (50 iterations in 3.330s)
[t-SNE] Iteration 4750: error = 2.5783863, gradient norm = 0.0000067 (50 iterations in 3.196s)
[t-SNE] Iteration 4800: error = 2.5783761, gradient norm = 0.0000055 (50 iterations in 3.828s)
[t-SNE] Iteration 4850: error = 2.5783443, gradient norm = 0.0000052 (50 iterations in 3.816s)
[t-SNE] Iteration 4900: error = 2.5783179, gradient norm = 0.0000062 (50 iterations in 3.894s)
[t-SNE] Iteration 4950: error = 2.5782964, gradient norm = 0.0000054 (50 iterations in 3.806s)
[t-SNE] Iteration 5000: error = 2.5782700, gradient norm = 0.0000061 (50 iterations in 3.823s)
[t-SNE] Iteration 5050: error = 2.5782683, gradient norm = 0.0000052 (50 iterations in 3.741s)
[t-SNE] Iteration 5100: error = 2.5782270, gradient norm = 0.0000049 (50 iterations in 3.155s)
[t-SNE] Iteration 5150: error = 2.5782347, gradient norm = 0.0000068 (50 iterations in 3.189s)
[t-SNE] Iteration 5200: error = 2.5782180, gradient norm = 0.0000053 (50 iterations in 3.156s)
[t-SNE] Iteration 5250: error = 2.5782146, gradient norm = 0.0000068 (50 iterations in 3.165s)
[t-SNE] Iteration 5300: error = 2.5781691, gradient norm = 0.0000059 (50 iterations in 3.169s)
[t-SNE] Iteration 5350: error = 2.5781546, gradient norm = 0.0000078 (50 iterations in 3.153s)
[t-SNE] Iteration 5400: error = 2.5781550, gradient norm = 0.0000075 (50 iterations in 3.118s)
[t-SNE] Iteration 5450: error = 2.5781529, gradient norm = 0.0000076 (50 iterations in 3.149s)
[t-SNE] Iteration 5500: error = 2.5781398, gradient norm = 0.0000068 (50 iterations in 3.470s)
[t-SNE] Iteration 5550: error = 2.5781162, gradient norm = 0.0000065 (50 iterations in 3.884s)
[t-SNE] Iteration 5600: error = 2.5781171, gradient norm = 0.0000078 (50 iterations in 3.932s)
[t-SNE] Iteration 5650: error = 2.5780885, gradient norm = 0.0000063 (50 iterations in 3.809s)
[t-SNE] Iteration 5700: error = 2.5780482, gradient norm = 0.0000067 (50 iterations in 3.790s)
[t-SNE] Iteration 5750: error = 2.5780609, gradient norm = 0.0000058 (50 iterations in 3.794s)
[t-SNE] Iteration 5800: error = 2.5780606, gradient norm = 0.0000057 (50 iterations in 3.787s)
[t-SNE] Iteration 5850: error = 2.5780416, gradient norm = 0.0000056 (50 iterations in 3.793s)
[t-SNE] Iteration 5900: error = 2.5780323, gradient norm = 0.0000067 (50 iterations in 3.953s)
[t-SNE] Iteration 5950: error = 2.5780232, gradient norm = 0.0000059 (50 iterations in 3.834s)
[t-SNE] Iteration 6000: error = 2.5780127, gradient norm = 0.0000055 (50 iterations in 3.796s)
[t-SNE] Iteration 6050: error = 2.5780177, gradient norm = 0.0000053 (50 iterations in 3.773s)
[t-SNE] Iteration 6100: error = 2.5780203, gradient norm = 0.0000056 (50 iterations in 3.865s)
[t-SNE] Iteration 6150: error = 2.5779536, gradient norm = 0.0000068 (50 iterations in 3.739s)
[t-SNE] Iteration 6200: error = 2.5779464, gradient norm = 0.0000070 (50 iterations in 3.372s)
[t-SNE] Iteration 6250: error = 2.5779507, gradient norm = 0.0000057 (50 iterations in 3.100s)
[t-SNE] Iteration 6300: error = 2.5779259, gradient norm = 0.0000071 (50 iterations in 3.285s)
[t-SNE] Iteration 6350: error = 2.5779343, gradient norm = 0.0000075 (50 iterations in 3.135s)
[t-SNE] Iteration 6400: error = 2.5778892, gradient norm = 0.0000060 (50 iterations in 2.912s)
[t-SNE] Iteration 6450: error = 2.5779023, gradient norm = 0.0000065 (50 iterations in 2.910s)
[t-SNE] Iteration 6500: error = 2.5779023, gradient norm = 0.0000061 (50 iterations in 2.979s)
[t-SNE] Iteration 6550: error = 2.5778730, gradient norm = 0.0000056 (50 iterations in 3.021s)
[t-SNE] Iteration 6600: error = 2.5778582, gradient norm = 0.0000051 (50 iterations in 2.930s)
[t-SNE] Iteration 6650: error = 2.5778654, gradient norm = 0.0000066 (50 iterations in 3.870s)
[t-SNE] Iteration 6700: error = 2.5778644, gradient norm = 0.0000063 (50 iterations in 3.616s)
[t-SNE] Iteration 6750: error = 2.5778396, gradient norm = 0.0000057 (50 iterations in 3.676s)
[t-SNE] Iteration 6800: error = 2.5778470, gradient norm = 0.0000055 (50 iterations in 3.560s)
[t-SNE] Iteration 6850: error = 2.5778341, gradient norm = 0.0000050 (50 iterations in 4.127s)
[t-SNE] Iteration 6900: error = 2.5778329, gradient norm = 0.0000078 (50 iterations in 3.916s)
[t-SNE] Iteration 6950: error = 2.5778260, gradient norm = 0.0000051 (50 iterations in 3.523s)
[t-SNE] Iteration 7000: error = 2.5777731, gradient norm = 0.0000075 (50 iterations in 3.522s)
[t-SNE] Iteration 7050: error = 2.5778215, gradient norm = 0.0000059 (50 iterations in 3.839s)
[t-SNE] Iteration 7100: error = 2.5778022, gradient norm = 0.0000071 (50 iterations in 3.662s)
[t-SNE] Iteration 7150: error = 2.5778155, gradient norm = 0.0000078 (50 iterations in 3.563s)
[t-SNE] Iteration 7200: error = 2.5778008, gradient norm = 0.0000063 (50 iterations in 3.389s)
[t-SNE] Iteration 7250: error = 2.5778084, gradient norm = 0.0000082 (50 iterations in 3.610s)
[t-SNE] Iteration 7300: error = 2.5778167, gradient norm = 0.0000066 (50 iterations in 3.395s)
[t-SNE] Iteration 7350: error = 2.5777993, gradient norm = 0.0000084 (50 iterations in 3.002s)
[t-SNE] Iteration 7350: did not make any progress during the last 300 episodes. Finished.
[t-SNE] KL divergence after 7350 iterations: 2.577799
--- 547.1324739456177 seconds ---
In [68]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [69]:
two_dimensional_reviews = pd.concat([train, df], axis=1)

two_dimensional_reviews.head()
Out[69]:
review sentiment review_vector X Y
0 I have NEVER fallen asleep whilst watching a m... negative [[-0.076031215, 0.29945657, -0.11639837, 0.108... 65.386780 -29.279146
1 My goodness. This movie really really shows th... positive [[-0.046262905, 0.27993667, -0.08239518, 0.129... 22.779188 3.622681
2 what ever you do do not waste your time on thi... negative [[-0.02881998, 0.28027973, -0.08297233, 0.1324... 24.404718 -9.010635
3 For me an unsatisfactory, unconvincing heist m... negative [[-0.04048575, 0.27070275, -0.04973773, 0.1353... -39.815666 -5.120996
4 When is ART going to overcome racism? I believ... positive [[-0.049328458, 0.2931728, -0.05681305, 0.1535... -14.821279 -30.028585
In [70]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_reviews['Y'],
    x = two_dimensional_reviews['X'],
    text = two_dimensional_reviews['review'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = np.random.randn(5717), #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]
iplot(data)
In [71]:
print(type(two_dimensional_reviews))

import plotly.express as px
fig = px.scatter(two_dimensional_reviews, x='X', y='Y',color='sentiment')
fig.show()
<class 'pandas.core.frame.DataFrame'>
In [72]:
X = np.array(review_vectors).reshape((5000, 50))

start_time = time.time()
tsne = sklearn.manifold.TSNE(n_components=2, n_iter=5000, random_state=0, verbose=2, perplexity = 40)

all_word_vectors_matrix_2d = tsne.fit_transform(X)

print("--- %s seconds ---" % (time.time() - start_time))
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 5000 samples in 0.008s...
[t-SNE] Computed neighbors for 5000 samples in 2.688s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.025903
[t-SNE] Computed conditional probabilities in 0.300s
[t-SNE] Iteration 50: error = 82.7747879, gradient norm = 0.0052280 (50 iterations in 4.849s)
[t-SNE] Iteration 100: error = 82.7188873, gradient norm = 0.0110336 (50 iterations in 3.634s)
[t-SNE] Iteration 150: error = 80.9156799, gradient norm = 0.0070917 (50 iterations in 3.185s)
[t-SNE] Iteration 200: error = 80.9118729, gradient norm = 0.0092043 (50 iterations in 1.896s)
[t-SNE] Iteration 250: error = 80.9136658, gradient norm = 0.0129794 (50 iterations in 2.326s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.913666
[t-SNE] Iteration 300: error = 2.8848817, gradient norm = 0.0011857 (50 iterations in 3.346s)
[t-SNE] Iteration 350: error = 2.7146776, gradient norm = 0.0004104 (50 iterations in 3.433s)
[t-SNE] Iteration 400: error = 2.6471219, gradient norm = 0.0002361 (50 iterations in 3.286s)
[t-SNE] Iteration 450: error = 2.6096075, gradient norm = 0.0001453 (50 iterations in 2.852s)
[t-SNE] Iteration 500: error = 2.5869939, gradient norm = 0.0000952 (50 iterations in 3.600s)
[t-SNE] Iteration 550: error = 2.5717599, gradient norm = 0.0000754 (50 iterations in 3.498s)
[t-SNE] Iteration 600: error = 2.5611899, gradient norm = 0.0000576 (50 iterations in 3.720s)
[t-SNE] Iteration 650: error = 2.5534480, gradient norm = 0.0000485 (50 iterations in 4.033s)
[t-SNE] Iteration 700: error = 2.5478196, gradient norm = 0.0000439 (50 iterations in 3.655s)
[t-SNE] Iteration 750: error = 2.5437245, gradient norm = 0.0000361 (50 iterations in 3.353s)
[t-SNE] Iteration 800: error = 2.5404596, gradient norm = 0.0000319 (50 iterations in 3.844s)
[t-SNE] Iteration 850: error = 2.5379720, gradient norm = 0.0000287 (50 iterations in 3.570s)
[t-SNE] Iteration 900: error = 2.5360162, gradient norm = 0.0000248 (50 iterations in 3.529s)
[t-SNE] Iteration 950: error = 2.5343206, gradient norm = 0.0000240 (50 iterations in 3.527s)
[t-SNE] Iteration 1000: error = 2.5329151, gradient norm = 0.0000212 (50 iterations in 3.571s)
[t-SNE] Iteration 1050: error = 2.5317297, gradient norm = 0.0000200 (50 iterations in 3.619s)
[t-SNE] Iteration 1100: error = 2.5307806, gradient norm = 0.0000208 (50 iterations in 3.235s)
[t-SNE] Iteration 1150: error = 2.5301380, gradient norm = 0.0000182 (50 iterations in 3.610s)
[t-SNE] Iteration 1200: error = 2.5295172, gradient norm = 0.0000167 (50 iterations in 3.529s)
[t-SNE] Iteration 1250: error = 2.5290420, gradient norm = 0.0000163 (50 iterations in 3.776s)
[t-SNE] Iteration 1300: error = 2.5285830, gradient norm = 0.0000180 (50 iterations in 3.759s)
[t-SNE] Iteration 1350: error = 2.5281873, gradient norm = 0.0000182 (50 iterations in 3.718s)
[t-SNE] Iteration 1400: error = 2.5277669, gradient norm = 0.0000176 (50 iterations in 3.926s)
[t-SNE] Iteration 1450: error = 2.5274401, gradient norm = 0.0000145 (50 iterations in 3.552s)
[t-SNE] Iteration 1500: error = 2.5271337, gradient norm = 0.0000136 (50 iterations in 3.445s)
[t-SNE] Iteration 1550: error = 2.5268302, gradient norm = 0.0000135 (50 iterations in 3.664s)
[t-SNE] Iteration 1600: error = 2.5265198, gradient norm = 0.0000134 (50 iterations in 3.934s)
[t-SNE] Iteration 1650: error = 2.5262799, gradient norm = 0.0000125 (50 iterations in 3.922s)
[t-SNE] Iteration 1700: error = 2.5260165, gradient norm = 0.0000121 (50 iterations in 3.889s)
[t-SNE] Iteration 1750: error = 2.5257819, gradient norm = 0.0000119 (50 iterations in 3.920s)
[t-SNE] Iteration 1800: error = 2.5255835, gradient norm = 0.0000129 (50 iterations in 3.930s)
[t-SNE] Iteration 1850: error = 2.5253787, gradient norm = 0.0000111 (50 iterations in 3.937s)
[t-SNE] Iteration 1900: error = 2.5251911, gradient norm = 0.0000111 (50 iterations in 3.904s)
[t-SNE] Iteration 1950: error = 2.5250318, gradient norm = 0.0000103 (50 iterations in 3.901s)
[t-SNE] Iteration 2000: error = 2.5248461, gradient norm = 0.0000132 (50 iterations in 3.934s)
[t-SNE] Iteration 2050: error = 2.5247121, gradient norm = 0.0000109 (50 iterations in 3.939s)
[t-SNE] Iteration 2100: error = 2.5245836, gradient norm = 0.0000106 (50 iterations in 3.970s)
[t-SNE] Iteration 2150: error = 2.5244184, gradient norm = 0.0000098 (50 iterations in 3.960s)
[t-SNE] Iteration 2200: error = 2.5242665, gradient norm = 0.0000095 (50 iterations in 3.935s)
[t-SNE] Iteration 2250: error = 2.5241094, gradient norm = 0.0000104 (50 iterations in 3.924s)
[t-SNE] Iteration 2300: error = 2.5239670, gradient norm = 0.0000103 (50 iterations in 3.947s)
[t-SNE] Iteration 2350: error = 2.5238514, gradient norm = 0.0000095 (50 iterations in 3.934s)
[t-SNE] Iteration 2400: error = 2.5237627, gradient norm = 0.0000091 (50 iterations in 4.117s)
[t-SNE] Iteration 2450: error = 2.5236154, gradient norm = 0.0000084 (50 iterations in 4.062s)
[t-SNE] Iteration 2500: error = 2.5235064, gradient norm = 0.0000088 (50 iterations in 4.039s)
[t-SNE] Iteration 2550: error = 2.5234015, gradient norm = 0.0000100 (50 iterations in 4.014s)
[t-SNE] Iteration 2600: error = 2.5233130, gradient norm = 0.0000092 (50 iterations in 4.110s)
[t-SNE] Iteration 2650: error = 2.5232179, gradient norm = 0.0000080 (50 iterations in 3.954s)
[t-SNE] Iteration 2700: error = 2.5231106, gradient norm = 0.0000083 (50 iterations in 3.474s)
[t-SNE] Iteration 2750: error = 2.5230346, gradient norm = 0.0000087 (50 iterations in 3.886s)
[t-SNE] Iteration 2800: error = 2.5229402, gradient norm = 0.0000090 (50 iterations in 4.021s)
[t-SNE] Iteration 2850: error = 2.5229108, gradient norm = 0.0000079 (50 iterations in 4.007s)
[t-SNE] Iteration 2900: error = 2.5228703, gradient norm = 0.0000079 (50 iterations in 4.019s)
[t-SNE] Iteration 2950: error = 2.5228128, gradient norm = 0.0000072 (50 iterations in 4.054s)
[t-SNE] Iteration 3000: error = 2.5227623, gradient norm = 0.0000078 (50 iterations in 4.064s)
[t-SNE] Iteration 3050: error = 2.5226765, gradient norm = 0.0000079 (50 iterations in 4.012s)
[t-SNE] Iteration 3100: error = 2.5226295, gradient norm = 0.0000078 (50 iterations in 3.995s)
[t-SNE] Iteration 3150: error = 2.5226064, gradient norm = 0.0000072 (50 iterations in 3.981s)
[t-SNE] Iteration 3200: error = 2.5225689, gradient norm = 0.0000068 (50 iterations in 3.964s)
[t-SNE] Iteration 3250: error = 2.5225039, gradient norm = 0.0000074 (50 iterations in 4.010s)
[t-SNE] Iteration 3300: error = 2.5224648, gradient norm = 0.0000081 (50 iterations in 4.051s)
[t-SNE] Iteration 3350: error = 2.5224009, gradient norm = 0.0000082 (50 iterations in 4.009s)
[t-SNE] Iteration 3400: error = 2.5223722, gradient norm = 0.0000082 (50 iterations in 4.061s)
[t-SNE] Iteration 3450: error = 2.5223341, gradient norm = 0.0000076 (50 iterations in 4.066s)
[t-SNE] Iteration 3500: error = 2.5222797, gradient norm = 0.0000070 (50 iterations in 4.078s)
[t-SNE] Iteration 3550: error = 2.5222704, gradient norm = 0.0000072 (50 iterations in 4.002s)
[t-SNE] Iteration 3600: error = 2.5222404, gradient norm = 0.0000068 (50 iterations in 3.983s)
[t-SNE] Iteration 3650: error = 2.5221994, gradient norm = 0.0000067 (50 iterations in 4.004s)
[t-SNE] Iteration 3700: error = 2.5221429, gradient norm = 0.0000062 (50 iterations in 3.985s)
[t-SNE] Iteration 3750: error = 2.5221126, gradient norm = 0.0000063 (50 iterations in 3.998s)
[t-SNE] Iteration 3800: error = 2.5220561, gradient norm = 0.0000068 (50 iterations in 4.021s)
[t-SNE] Iteration 3850: error = 2.5220032, gradient norm = 0.0000062 (50 iterations in 4.014s)
[t-SNE] Iteration 3900: error = 2.5219584, gradient norm = 0.0000067 (50 iterations in 4.007s)
[t-SNE] Iteration 3950: error = 2.5218801, gradient norm = 0.0000066 (50 iterations in 3.556s)
[t-SNE] Iteration 4000: error = 2.5218496, gradient norm = 0.0000071 (50 iterations in 3.774s)
[t-SNE] Iteration 4050: error = 2.5218008, gradient norm = 0.0000067 (50 iterations in 3.996s)
[t-SNE] Iteration 4100: error = 2.5217941, gradient norm = 0.0000058 (50 iterations in 3.969s)
[t-SNE] Iteration 4150: error = 2.5217671, gradient norm = 0.0000061 (50 iterations in 4.012s)
[t-SNE] Iteration 4200: error = 2.5217240, gradient norm = 0.0000070 (50 iterations in 3.996s)
[t-SNE] Iteration 4250: error = 2.5216813, gradient norm = 0.0000061 (50 iterations in 4.054s)
[t-SNE] Iteration 4300: error = 2.5216670, gradient norm = 0.0000063 (50 iterations in 4.311s)
[t-SNE] Iteration 4350: error = 2.5216196, gradient norm = 0.0000064 (50 iterations in 4.590s)
[t-SNE] Iteration 4400: error = 2.5215342, gradient norm = 0.0000061 (50 iterations in 4.836s)
[t-SNE] Iteration 4450: error = 2.5215471, gradient norm = 0.0000065 (50 iterations in 4.832s)
[t-SNE] Iteration 4500: error = 2.5215011, gradient norm = 0.0000060 (50 iterations in 4.826s)
[t-SNE] Iteration 4550: error = 2.5214925, gradient norm = 0.0000060 (50 iterations in 4.904s)
[t-SNE] Iteration 4600: error = 2.5214643, gradient norm = 0.0000057 (50 iterations in 5.073s)
[t-SNE] Iteration 4650: error = 2.5214279, gradient norm = 0.0000064 (50 iterations in 5.096s)
[t-SNE] Iteration 4700: error = 2.5214181, gradient norm = 0.0000063 (50 iterations in 5.203s)
[t-SNE] Iteration 4750: error = 2.5213704, gradient norm = 0.0000070 (50 iterations in 5.327s)
[t-SNE] Iteration 4800: error = 2.5213857, gradient norm = 0.0000068 (50 iterations in 5.461s)
[t-SNE] Iteration 4850: error = 2.5213616, gradient norm = 0.0000064 (50 iterations in 5.407s)
[t-SNE] Iteration 4900: error = 2.5213449, gradient norm = 0.0000064 (50 iterations in 5.171s)
[t-SNE] Iteration 4950: error = 2.5213377, gradient norm = 0.0000061 (50 iterations in 5.098s)
[t-SNE] Iteration 5000: error = 2.5213058, gradient norm = 0.0000069 (50 iterations in 5.142s)
[t-SNE] KL divergence after 5000 iterations: 2.521306
--- 401.1164610385895 seconds ---
In [73]:
df=pd.DataFrame(all_word_vectors_matrix_2d,columns=['X','Y'])

df.head(10)

train.head()

df.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
In [74]:
two_dimensional_reviews = pd.concat([train, df], axis=1)

two_dimensional_reviews.head()
Out[74]:
review sentiment review_vector X Y
0 I have NEVER fallen asleep whilst watching a m... negative [[-0.076031215, 0.29945657, -0.11639837, 0.108... 55.727810 5.585767
1 My goodness. This movie really really shows th... positive [[-0.046262905, 0.27993667, -0.08239518, 0.129... 13.465773 -15.574947
2 what ever you do do not waste your time on thi... negative [[-0.02881998, 0.28027973, -0.08297233, 0.1324... 39.684711 -10.038692
3 For me an unsatisfactory, unconvincing heist m... negative [[-0.04048575, 0.27070275, -0.04973773, 0.1353... -42.544025 16.781073
4 When is ART going to overcome racism? I believ... positive [[-0.049328458, 0.2931728, -0.05681305, 0.1535... -0.656627 -2.017090
In [75]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_reviews['Y'],
    x = two_dimensional_reviews['X'],
    text = two_dimensional_reviews['review'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = two_dimensional_reviews['X'], #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]
iplot(data)
In [76]:
print(type(two_dimensional_reviews))

import plotly.express as px
fig = px.scatter(two_dimensional_reviews, x='X', y='Y',color='sentiment')
fig.show()
<class 'pandas.core.frame.DataFrame'>
In [77]:
two_dimensional_reviews["sentiment"] = two_dimensional_reviews["sentiment"].astype('category')
In [78]:
two_dimensional_reviews["sentiment_cat"] = two_dimensional_reviews["sentiment"].cat.codes
In [79]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

import plotly.graph_objs as go

trace1 = go.Scatter(
    y = two_dimensional_reviews['Y'],
    x = two_dimensional_reviews['X'],
    text = two_dimensional_reviews['review'],
    mode='markers',
    marker=dict(
        size= 5,#'7',
        color = two_dimensional_reviews['sentiment_cat'], #set color equal to a variable
        colorscale='Viridis',
        showscale=True
    )
)
data = [trace1]
iplot(data)
In [ ]: